Page MenuHomePhabricator

ema (Emanuele Rocca)
Staff Site Reliability Engineer, Traffic TeamAdministrator

Projects (6)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Sep 29 2015, 8:49 PM (269 w, 5 d)
Roles
Administrator
Availability
Available
IRC Nick
ema
LDAP User
Ema
MediaWiki User
Unknown

Recent Activity

Today

ema closed T268883: fifo-log-tailer: gracefully handle missing unix socket as Resolved.
root@cp4028:~# fifo-log-tailer -socket this-does-not-exist-at-all.socket
2020/11/30 16:42:38 Unable to read from socket: dial unix this-does-not-exist-at-all.socket: connect: no such file or directory
2020/11/30 16:42:39 Unable to read from socket: dial unix this-does-not-exist-at-all.socket: connect: no such file or directory
[...]
2020/11/30 16:42:48 Could not connect to this-does-not-exist-at-all.socket after 10 attempts. Exiting.
Mon, Nov 30, 4:43 PM · Operations, Traffic
ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

Varnish 6.0.7 is behaving well in terms of functionality on cp4032 (T268736).

Mon, Nov 30, 2:15 PM · Performance-Team (Radar), Operations, Traffic
ema closed T256467: Make atsmtail-backend.service depend on fifo-log-demux as Resolved.

Unit ordering at boot time is now correct:

Mon, Nov 30, 10:34 AM · Operations, Traffic

Fri, Nov 27

ema moved T268883: fifo-log-tailer: gracefully handle missing unix socket from Triage to Bug Reports on the Traffic board.
Fri, Nov 27, 10:50 AM · Operations, Traffic
ema triaged T268883: fifo-log-tailer: gracefully handle missing unix socket as Low priority.
Fri, Nov 27, 10:50 AM · Operations, Traffic
ema created T268883: fifo-log-tailer: gracefully handle missing unix socket.
Fri, Nov 27, 10:50 AM · Operations, Traffic
ema added a comment to T265625: ats-be occasional system CPU usage increase.

This happened again last night at 2020-11-27T00:08, we had alerts on cp1089, cp1077, cp1087, cp1083 and cp1075 in eqiad, cp2029 (codfw), cp3062 and cp3064 (esams), and cp5009 (eqsin):

Fri, Nov 27, 10:09 AM · Operations, Traffic

Thu, Nov 26

ema closed T256302: Certain links being rejected by caching if opened in Internet Explorer with a HTTP 400 error as Invalid.

Timing out given that 5 months have passed since this issue was reported and to the best of my knowledge it was an isolated case. Feel free to reopen if it happens again obviously.

Thu, Nov 26, 3:05 PM · Operations, Traffic
ema updated subscribers of T268736: Package and deploy varnish 6.0.7.

@Gilles: FYI during the next few weeks we'll be upgrading to this latest bugfix release. The list of changes (see task description) does not seem to suggest anything that could have an obvious performance impact, but you never know. I am going to upgrade one single node first, see how it behaves for a while and then proceed with the rest.

Thu, Nov 26, 9:11 AM · Traffic, Operations
ema moved T268736: Package and deploy varnish 6.0.7 from Triage to Feature Requests on the Traffic board.
Thu, Nov 26, 9:06 AM · Traffic, Operations
ema added a comment to T266857: 2nd part of blog post series: the evolution of Wikimedia's Content Delivery Network.

Can you look it over and make sure that everything looks correct before I announce on Twitter?

Thu, Nov 26, 8:46 AM · Traffic, Technical-blog-posts, Operations

Wed, Nov 25

ema added a comment to T266857: 2nd part of blog post series: the evolution of Wikimedia's Content Delivery Network.

I decided on this image of a road: https://commons.wikimedia.org/wiki/File:On_the_road,_Death_Valley_(23702938504).jpg, but if you have a different image you prefer, let me know!

Wed, Nov 25, 3:45 PM · Traffic, Technical-blog-posts, Operations
ema added a comment to T268736: Package and deploy varnish 6.0.7.

I've tried building 6.0.7 on my workstation to double-check the changes between 6.0.6 and 6.0.7 with debdiff. When running the tests, ./bin/varnishtest/tests/m00035.vtc fails with a segmentation fault:

Wed, Nov 25, 2:01 PM · Traffic, Operations
ema added a comment to T264378: ATS-BE Lua mitigations for cacheable responses w/ Set-Cookie seemingly not working.

So, should we try to prevent MediaWiki from emitting cookies in 304 responses, or it doesn't really matter? IIUC it will result in misleading logs but is otherwise harmless.

Wed, Nov 25, 12:15 PM · Operations, Traffic
ema added a comment to T268736: Package and deploy varnish 6.0.7.

I've tried building 6.0.7 on my workstation to double-check the changes between 6.0.6 and 6.0.7 with debdiff. When running the tests, ./bin/varnishtest/tests/m00035.vtc fails with a segmentation fault:

Wed, Nov 25, 11:30 AM · Traffic, Operations
ema created P13407 m00035.log.
Wed, Nov 25, 11:27 AM
ema triaged T268736: Package and deploy varnish 6.0.7 as Medium priority.
Wed, Nov 25, 11:10 AM · Traffic, Operations
ema created T268736: Package and deploy varnish 6.0.7.
Wed, Nov 25, 11:10 AM · Traffic, Operations

Tue, Nov 24

ema moved T263788: backport ipvsadm>=1.30 to buster-wikimedia or buster-backports from Triage to Feature Requests on the Traffic board.
Tue, Nov 24, 3:22 PM · Operations, Traffic
ema moved T263797: Switch to Maglev hashing ('mh') on LVS hosts from Triage to Feature Requests on the Traffic board.
Tue, Nov 24, 3:22 PM · Operations, Traffic
ema moved T263829: cloudweb2001-dev: add TLS termination from Triage to Feature Requests on the Traffic board.
Tue, Nov 24, 3:19 PM · HTTPS, cloud-services-team (Kanban), Operations, Cloud-Services, Traffic
ema moved T263830: contint.wikimedia.org: add TLS termination from Triage to Feature Requests on the Traffic board.
Tue, Nov 24, 3:19 PM · HTTPS, Continuous-Integration-Infrastructure, Traffic, Operations
ema moved T263831: puppetmaster[12]001: add TLS termination from Triage to Feature Requests on the Traffic board.
Tue, Nov 24, 3:19 PM · HTTPS, Operations, serviceops, Traffic
ema moved T267435: Beta cluster seems to be extremely slow for logged in user during page navigation from Triage to Watching on the Traffic board.
Tue, Nov 24, 3:18 PM · Release-Engineering-Team-TODO, Traffic, Operations, Beta-Cluster-Infrastructure
ema closed T268243: Broken package state on cp4032 as Resolved.

Mentioned in SAL (#wikimedia-operations) [2020-11-24T09:13:19Z] <ema> cp4032: switch back to varnish 6.0.6-1wm2 after T264398 experiment, fix T268243

Tue, Nov 24, 3:18 PM · Traffic, Operations
ema closed T264378: ATS-BE Lua mitigations for cacheable responses w/ Set-Cookie seemingly not working as Resolved.

<full_speculation_mode>

  1. the history API responds with Cache-Control: max-age=60 and no session cookie, thus no logging of "cacheable cookie" at the ATS level. The object gets cached by both ATS and Varnish
  2. Upon revalidation, Varnish sends a conditional request to ATS backend, which either has no cached object anymore or also has to revalidate from the origin
  3. The origin responds with a session cookie and no Cache-Control
  4. Varnish does the logging (on conditional requests, beresp isn't simply the current backend response as any sane person would think, it also includes all headers originally cached)
Tue, Nov 24, 2:14 PM · Operations, Traffic
ema edited P13386 36-set-cookie-logging-false-positives.vtc.log.
Tue, Nov 24, 1:41 PM
ema created P13386 36-set-cookie-logging-false-positives.vtc.log.
Tue, Nov 24, 1:33 PM
ema closed Restricted Task, a subtask of T264370: User authentication security issue (Oct 1), as Invalid.
Tue, Nov 24, 9:33 AM · Wikimedia-General-or-Unknown, Security
ema added a comment to T266857: 2nd part of blog post series: the evolution of Wikimedia's Content Delivery Network.

@ema do you have any preference for a featured image?

Tue, Nov 24, 8:51 AM · Traffic, Technical-blog-posts, Operations

Mon, Nov 23

ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

We should probably wait to have at least a full week of post-warmup data for confirmation. This early result suggests that 6.0.4 might be doing a little better, but definitely not "5.1.3 better"...

Mon, Nov 23, 5:02 PM · Performance-Team (Radar), Operations, Traffic
ema added a project to T268243: Broken package state on cp4032: Traffic.
Mon, Nov 23, 4:18 PM · Traffic, Operations
ema added a comment to T266857: 2nd part of blog post series: the evolution of Wikimedia's Content Delivery Network.

Take a look at the suggestions and accept or decline what you like.

Mon, Nov 23, 3:39 PM · Traffic, Technical-blog-posts, Operations
ema added a comment to T267435: Beta cluster seems to be extremely slow for logged in user during page navigation.

@ema I noticed you'd done some work on the deployment-cache-text06 layers at some point -- would you have time to take a look at what might be causing routing slowness there?

Mon, Nov 23, 12:45 PM · Release-Engineering-Team-TODO, Traffic, Operations, Beta-Cluster-Infrastructure

Fri, Nov 6

ema added a comment to T266857: 2nd part of blog post series: the evolution of Wikimedia's Content Delivery Network.

@ema just checking in on this. Do you have a draft you are currently working on?

Fri, Nov 6, 9:58 AM · Traffic, Technical-blog-posts, Operations

Thu, Nov 5

ema added a comment to T233474: Ensure graphs used by Performance account for Varnish-to-ATS migration.

@Krinkle: anything left TBD here?

Thu, Nov 5, 8:50 AM · Traffic, Performance-Team, Operations, observability
ema placed T209590: HTTP/2 requests fail with too-long URLs up for grabs.
Thu, Nov 5, 8:47 AM · Traffic, Operations

Oct 30 2020

ema added a comment to T266791: Requesting access to production shell groups for DNdubane.

I couldn't find @DNdubane_WMF's signature on L3, task description updated accordingly.

Oct 30 2020, 11:15 AM · Operations, SRE-Access-Requests
ema updated the task description for T266791: Requesting access to production shell groups for DNdubane.
Oct 30 2020, 11:13 AM · Operations, SRE-Access-Requests
ema triaged T266791: Requesting access to production shell groups for DNdubane as Medium priority.
Oct 30 2020, 11:07 AM · Operations, SRE-Access-Requests
ema added a comment to T266718: Requesting access to prod cluster for annet.

@AnneT: please let us know if everything is working as expected!

Oct 30 2020, 10:53 AM · Operations, SRE-Access-Requests
ema updated the task description for T266718: Requesting access to prod cluster for annet.
Oct 30 2020, 10:52 AM · Operations, SRE-Access-Requests
ema moved T266040: Large text objects are randomized to cache backends from Triage to Bug Reports on the Traffic board.
Oct 30 2020, 9:59 AM · Patch-For-Review, Operations, Traffic
ema moved T266651: varnish crash upon reload after libvmod-netmapper upgrade due to liburcu6 assertion from Triage to Bug Reports on the Traffic board.
Oct 30 2020, 9:59 AM · Operations, Traffic
ema moved T266746: TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001 from Triage to Bug Reports on the Traffic board.
Oct 30 2020, 9:59 AM · Operations, Traffic
ema moved T266857: 2nd part of blog post series: the evolution of Wikimedia's Content Delivery Network from Triage to Caching on the Traffic board.
Oct 30 2020, 9:57 AM · Traffic, Technical-blog-posts, Operations
ema triaged T266857: 2nd part of blog post series: the evolution of Wikimedia's Content Delivery Network as Medium priority.
Oct 30 2020, 9:57 AM · Traffic, Technical-blog-posts, Operations
ema created T266857: 2nd part of blog post series: the evolution of Wikimedia's Content Delivery Network.
Oct 30 2020, 9:56 AM · Traffic, Technical-blog-posts, Operations

Oct 29 2020

ema closed T266498: New prod ssh key for calbon as Resolved.
Oct 29 2020, 3:50 PM · Operations, SRE-Access-Requests
ema added a comment to T266498: New prod ssh key for calbon.

@calbon: please let me know if you now have access and we can close this. Thanks!

Oct 29 2020, 3:24 PM · Operations, SRE-Access-Requests

Oct 28 2020

ema triaged T266651: varnish crash upon reload after libvmod-netmapper upgrade due to liburcu6 assertion as High priority.
Oct 28 2020, 10:38 AM · Operations, Traffic
ema created T266651: varnish crash upon reload after libvmod-netmapper upgrade due to liburcu6 assertion.
Oct 28 2020, 10:38 AM · Operations, Traffic
ema added a comment to T266498: New prod ssh key for calbon.

I've pinged @calbon on Google Chat asking to confirm the public key, taking care of the puppet change once I hear from him.

Oct 28 2020, 9:52 AM · Operations, SRE-Access-Requests
ema triaged T266498: New prod ssh key for calbon as Medium priority.
Oct 28 2020, 9:24 AM · Operations, SRE-Access-Requests
ema added a comment to T263683: Mechanism to flag webrequests as "debug".

@Millimetric, after discussing with @ema, traffic feels that those requests should be visible in turnilo (eg webrequests_sampled_128), but we should be able to filter them out easily.

Oct 28 2020, 9:16 AM · Patch-For-Review, serviceops, Analytics-Kanban, Analytics, User-jijiki

Oct 27 2020

ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

With T266567 out of the way, we can now try different Varnish 6 versions, at least as long as they're VRT-compatible.

Oct 27 2020, 3:30 PM · Performance-Team (Radar), Operations, Traffic
ema closed T266567: libvmod-netmapper: must specify ABI stanza as Resolved.

Done in libvmod-netmapper 1.9-1, closing.

Oct 27 2020, 3:28 PM · Operations, Traffic
ema triaged T266567: libvmod-netmapper: must specify ABI stanza as Medium priority.
Oct 27 2020, 2:06 PM · Operations, Traffic
ema created T266567: libvmod-netmapper: must specify ABI stanza.
Oct 27 2020, 2:06 PM · Operations, Traffic
ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

Given that the amount of changes between 5.1.3 and 6.0.6 is considerable, I was thinking of following this "bisect-like" apporach: package Varnish 6.0.2, try it out on a node, see if there's any difference. If 6.0.2 performs better, than the regression happened between 6.0.3 and 6.0.6, otherwise earlier than 6.0.2.

Oct 27 2020, 10:21 AM · Performance-Team (Radar), Operations, Traffic

Oct 26 2020

ema added a comment to T265911: ATS trying to set socket options SO_MARK / IP_TOS.

Mentioned in SAL (#wikimedia-operations) [2020-10-26T11:11:10Z] <vgutierrez> upgrade trafficserver to 8.0.8-1wm3 on cp4032 - T265911

Oct 26 2020, 12:32 PM · Operations, Traffic
ema moved T265911: ATS trying to set socket options SO_MARK / IP_TOS from Triage to Bug Reports on the Traffic board.
Oct 26 2020, 12:26 PM · Operations, Traffic
ema added a comment to T101017: Early security release access for Lcawte (ShoutWiki).

There is an internal draft policy (I just gave you access) which I feel is mostly complete save clarification on a couple of the actual technical controls and processes. This needs some push from the Security-Team but I believe it is considered fairly low priority for us at this time.

Oct 26 2020, 12:24 PM · user-sbassett, Security-Team, ShoutWiki, WMF-Legal, WMF-NDA-Requests

Oct 23 2020

ema added a comment to T266155: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails .

@ema can you confirm that int-front cache status in the response means that the 429 was emitted by Varnish? From one of those: https://github.com/wikimedia/puppet/blob/338c1bd746aedf5c7ea7303cf31c64f30b9fee93/modules/varnish/templates/upload-frontend.inc.vcl.erb#L205

Oct 23 2020, 11:22 AM · StructuredDataOnCommons, Patch-For-Review, Operations, MediaWiki-File-management, Thumbor, Commons

Oct 22 2020

ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

Before we dig into that, the "total time" we're going to collect from ATS-TLS should be enough to know whether it can contain the extra latency or not. If it can't possibly contain it, then there's no point investigating delays or buffering between ATS-TLS and Varnish or other parts of the "total" timeframe.

Oct 22 2020, 9:45 AM · Performance-Team (Radar), Operations, Traffic
ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

cp3054 seems to be consistently a little faster for miss and pass, and overall a little slower for hit-front and hit-local. But I still can't see anything in the order of the extra tens of milliseconds seen on clients.

Oct 22 2020, 9:13 AM · Performance-Team (Radar), Operations, Traffic

Oct 21 2020

ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

As @dpifke pointed out in our team meeting yesterday, there's also the possibility that v5 was returning hits on things that it shouldn't have, that are now correctly returned as misses. It would be really interesting to figure out if there's any pattern in what tends to be a hit on v5 vs v6. Understanding that delta might let us know if we're looking at a bug or a bugfix.

Oct 21 2020, 3:29 PM · Performance-Team (Radar), Operations, Traffic
ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

I found that there's a significant difference between the number of n_objecthead on v5 and v6:

Oct 21 2020, 1:18 PM · Performance-Team (Radar), Operations, Traffic
ema added a comment to T266040: Large text objects are randomized to cache backends.
  1. User requests /foo/bar -> frontend cache cp1234 miss -> chash to cp9999
  2. Response from cp9999 indicates CL:500KB, so vcl_backend_response does a return(pass(beresp.ttl))
  3. Next request for /foo/bar to cp1234 -> finds hfp object -> goes through vcl_pass -> random backend cpXXXX

It would have to be during that step 3 request that we would "know" that ats-be can't cache the object either, which is tricky I think.

Oct 21 2020, 12:38 PM · Patch-For-Review, Operations, Traffic
ema added a comment to T266040: Large text objects are randomized to cache backends.

Instead of using hfp vs hfm, I think we might want to distinguish between requests that definitely cannot be cached at the ats-be layer (eg: those with req.http.Authorization) and those that potentially could result in a backend hit, like large_objects_cutoff. The former should honor pass_random in vcl_pass, the latter should always chash no matter what pass_random says?

Oct 21 2020, 10:39 AM · Patch-For-Review, Operations, Traffic
ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

Change 635298 merged by Ema:
[operations/puppet@production] varnish: fix websockets on 6.x

https://gerrit.wikimedia.org/r/635298

Oct 21 2020, 10:06 AM · Performance-Team (Radar), Operations, Traffic
ema added a comment to T265324: Create the base container images for running MediaWiki in a production environment.
  • one base image, which uses the apache2-bin debian package and just modifies the vanilla configuration to listen on port 8080 (so that the container can run as user www-data).

In case it makes things easier/cleaner, instead of modifying the configuration you could set the capability CAP_NET_BIND_SERVICE.

Oct 21 2020, 8:35 AM · Operations, serviceops, MW-on-K8s
ema created P13041 635302-upload-vtc-err.log.
Oct 21 2020, 8:31 AM
ema created P13040 cp-part.diff.
Oct 21 2020, 8:02 AM

Oct 20 2020

ema closed T203191: prometheus-varnish-exporter@frontend.service: Unit entered failed state - invalid character 'C' as Resolved.

The following now returns nothing:

Oct 20 2020, 2:31 PM · observability, Traffic, Operations
ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).
SELECT event.responsestart - event.fetchstart FROM event.navigationtiming WHERE year = 2020 AND month = 10 AND day > 6 AND recvfrom = 'cp3052.esams.wmnet' AND event.isOversample = false AND event.responsestart - event.fetchstart >= 0;

I've noticed that on nodes with Varnish 6 the worst time_firstbyte values reported by ats-tls are very often around 26 seconds, and they're due to etherpad. Can you try this once again, but excluding Host: etherpad.wikimedia.org?

Oct 20 2020, 1:08 PM · Performance-Team (Radar), Operations, Traffic
ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

I've captured 30 minutes of data using varnishlog simultaneously on cp3052 and cp3054, using 4 variants of this command for hit-front, hit-local, miss and pass:

varnishlog -n frontend -q "RespHeader eq 'X-Cache-Status: hit-front'" -q "ReqURL ~ '^/wiki/*'" -i Timestamp | grep Resp | awk '{print $5}' > cp3052-hit-front.log

[...]

It's possible that the extra time comes from something Varnish doesn't measure. It's unclear to me whether the last Timestamp in a Varnish response includes the time it took to actually ship the bytes to the client (ats-tls in this case?) and have them acknowledged.

Oct 20 2020, 10:06 AM · Performance-Team (Radar), Operations, Traffic
ema triaged T265869: Consider collecting more timestamp milestones from ATS-TLS as Medium priority.
Oct 20 2020, 9:26 AM · Performance-Team (Radar), Operations, Traffic

Oct 19 2020

ema created T265911: ATS trying to set socket options SO_MARK / IP_TOS.
Oct 19 2020, 2:44 PM · Operations, Traffic
ema moved T265625: ats-be occasional system CPU usage increase from Triage to Caching on the Traffic board.
Oct 19 2020, 2:32 PM · Operations, Traffic
ema moved T265869: Consider collecting more timestamp milestones from ATS-TLS from Triage to Caching on the Traffic board.
Oct 19 2020, 2:32 PM · Performance-Team (Radar), Operations, Traffic

Oct 16 2020

ema updated ema.
Oct 16 2020, 9:34 AM
ema closed T264074: varnishkafka 1.1.0 CPU usage increase as Resolved.

All varnishkafka instances restarted with 6.0.6-1wm2, CPU usage looks like this now:

Oct 16 2020, 9:30 AM · Patch-For-Review, Analytics-Clusters, Traffic, Operations
ema added a comment to T264729: Blog post series: the evolution of Wikimedia's Content Delivery Network.

This has been changed, and I announced on Twitter.

Oct 16 2020, 8:58 AM · Operations, Traffic, Technical-blog-posts
ema closed T185968: varnish 5.1.3 frontend child restarted as Resolved.

We haven't seen this happening anymore after setting transient storage limits. Closing.

Oct 16 2020, 8:13 AM · Traffic, Operations

Oct 15 2020

ema created T265625: ats-be occasional system CPU usage increase.
Oct 15 2020, 3:14 PM · Operations, Traffic
ema added a comment to T264074: varnishkafka 1.1.0 CPU usage increase.

I've upgraded cp3050 to 6.0.6-1wm2 and restarted varnishkafka-webrequest.service at 14:12 to pick up the new library. Varnishkafka's CPU usage went down immediately as expected. I've then reloaded the service at 14:21: on systems affected by this bug that would have resulted in CPU usage going back up. On cp3050 CPU usage stayed the same.

Oct 15 2020, 2:37 PM · Patch-For-Review, Analytics-Clusters, Traffic, Operations
ema closed T162612: codfw/eqiad hosts occasionally spend > 3 minutes starting networking.service with linux 4.9 as Resolved.

@ema: All related patches in Gerrit have been merged. Can this task be resolved

Oct 15 2020, 1:29 PM · Operations
ema closed T162612: codfw/eqiad hosts occasionally spend > 3 minutes starting networking.service with linux 4.9, a subtask of T162029: Migrate all jessie hosts to Linux 4.9, as Resolved.
Oct 15 2020, 1:29 PM · Operations
ema added a comment to T264729: Blog post series: the evolution of Wikimedia's Content Delivery Network.

"Own work" is not a person or a license, seen in captions. Should be replaced with eg "attribution/license", or omitted, I think?

Oct 15 2020, 8:29 AM · Operations, Traffic, Technical-blog-posts

Oct 14 2020

ema added a comment to T264074: varnishkafka 1.1.0 CPU usage increase.

I've opened https://github.com/varnishcache/varnish-cache/issues/3436 for 6.5/master, https://github.com/varnishcache/varnish-cache/issues/3437 for 6.0.6 (LTS), and proposed https://github.com/varnishcache/varnish-cache/pull/3438 as a fix for the latter.

Oct 14 2020, 1:25 PM · Patch-For-Review, Analytics-Clusters, Traffic, Operations
ema updated subscribers of T264729: Blog post series: the evolution of Wikimedia's Content Delivery Network.

@ema are you the sole author

Oct 14 2020, 9:41 AM · Operations, Traffic, Technical-blog-posts

Oct 13 2020

ema added a comment to T264074: varnishkafka 1.1.0 CPU usage increase.

So varnishkafka seems to be correctly looping continuously in the do-while part of VUT_Main. Why is VSM_Status being called so often remains an open question.

Oct 13 2020, 10:50 AM · Patch-For-Review, Analytics-Clusters, Traffic, Operations
ema added a comment to T264074: varnishkafka 1.1.0 CPU usage increase.

The function VUT_Main is the main loop of VUT programs. The while loop boils down to:

Oct 13 2020, 8:11 AM · Patch-For-Review, Analytics-Clusters, Traffic, Operations

Oct 12 2020

ema added a comment to T264074: varnishkafka 1.1.0 CPU usage increase.

I am wondering if we could quickly test how varnishncsa behaves when we pass -q, that seems to be the big difference between the two.

Oct 12 2020, 5:36 PM · Patch-For-Review, Analytics-Clusters, Traffic, Operations
ema added a comment to T264074: varnishkafka 1.1.0 CPU usage increase.

Most of the usage seems to be VUT related, especially for fxstatat64 (no idea where it is used).

Oct 12 2020, 3:27 PM · Patch-For-Review, Analytics-Clusters, Traffic, Operations
ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

I guess the difference is that I did a HEAD request. It's still reproducible right now.

Oct 12 2020, 12:40 PM · Performance-Team (Radar), Operations, Traffic
ema updated the task description for T257118: Beta cluster has reached its quota.
Oct 12 2020, 11:22 AM · Beta-Cluster-Infrastructure
ema added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

I've checked with curl and it does get cached by both hosts:

[...]

Oddly, cp3054 has Content-Length defined in the headers I get back and not cp3052?

Oct 12 2020, 9:47 AM · Performance-Team (Radar), Operations, Traffic
ema added a comment to T264987: Add cache response type and response size as new dimensions to navtiming_responsestart_by_host_seconds prometheus metric.

@ema since these new dimensions are labels, for transfersize we're going to need to come up with buckets ourselves. What buckets would you be interested in tracking?

Oct 12 2020, 9:37 AM · Patch-For-Review, MW-1.36-notes (1.36.0-wmf.18; 2020-11-17), MediaWiki-extensions-NavigationTiming, Performance-Team