Page MenuHomePhabricator

ema (Emanuele Rocca)
Senior Site Reliability Engineer, Traffic Team

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Sep 29 2015, 8:49 PM (198 w, 2 d)
Availability
Available
IRC Nick
ema
LDAP User
Ema
MediaWiki User
Unknown

Recent Activity

Today

ema created P8773 mtail-error.log.
Thu, Jul 18, 1:28 PM

Yesterday

ema committed rLPRI234723e0ff7a: secret: dummy key for restbase (authored by ema).
secret: dummy key for restbase
Wed, Jul 17, 2:04 PM
ema moved T228135: ATS lacks the possibility of reporting SSL stats to an origin server via HTTP Headers from Triage to TLS on the Traffic board.
Wed, Jul 17, 9:36 AM · Traffic, Operations

Tue, Jul 16

ema created P8751 (An Untitled Masterwork).
Tue, Jul 16, 3:54 PM
ema created P8750 most-depressing-mtr.
Tue, Jul 16, 2:12 PM

Mon, Jul 15

ema moved T227828: Wikipedia is unavailable on Symbian phone's browsers from Triage to TLS on the Traffic board.
Mon, Jul 15, 10:19 AM · Traffic, Operations
ema triaged T227828: Wikipedia is unavailable on Symbian phone's browsers as Normal priority.
Mon, Jul 15, 10:19 AM · Traffic, Operations

Fri, Jul 12

ema created P8741 traffic-text-ats.yaml.
Fri, Jul 12, 1:49 PM
ema moved T227860: TLS certificates for Analytics origin servers from Triage to Caching on the Traffic board.
Fri, Jul 12, 9:53 AM · Patch-For-Review, Analytics-Kanban, User-Elukey, Operations, Analytics, Traffic
ema triaged T227860: TLS certificates for Analytics origin servers as Normal priority.
Fri, Jul 12, 9:48 AM · Patch-For-Review, Analytics-Kanban, User-Elukey, Operations, Analytics, Traffic
ema created T227860: TLS certificates for Analytics origin servers.
Fri, Jul 12, 9:48 AM · Patch-For-Review, Analytics-Kanban, User-Elukey, Operations, Analytics, Traffic
ema created P8740 beline.
Fri, Jul 12, 9:18 AM

Thu, Jul 11

ema triaged T225604: log spam from mtail 3.0.0~rc19 on wezen as Normal priority.
Thu, Jul 11, 8:42 AM · Patch-For-Review, observability

Wed, Jul 10

ema updated the task description for T227672: Upgrade Varnish to 5.1.3-1wm11.
Wed, Jul 10, 3:09 PM · Operations, Traffic
ema updated the task description for T227672: Upgrade Varnish to 5.1.3-1wm11.
Wed, Jul 10, 2:51 PM · Operations, Traffic
ema moved T227672: Upgrade Varnish to 5.1.3-1wm11 from Triage to Caching on the Traffic board.
Wed, Jul 10, 2:50 PM · Operations, Traffic
ema triaged T227672: Upgrade Varnish to 5.1.3-1wm11 as Normal priority.
Wed, Jul 10, 2:50 PM · Operations, Traffic
ema created T227672: Upgrade Varnish to 5.1.3-1wm11.
Wed, Jul 10, 2:50 PM · Operations, Traffic
ema added a comment to T216140: Investigating using CI to automate testing VCL changes against all cluster/dc combos.

Note that VTC tests do not have to be run on cache servers anymore. They can be executed from SRE workstations using vagrant, which should help here. See https://wikitech.wikimedia.org/wiki/Varnish#Configuration

Wed, Jul 10, 2:18 PM · Operations, Traffic
ema moved T227668: Per-backend ATS Prometheus metrics from Triage to Caching on the Traffic board.
Wed, Jul 10, 2:05 PM · User-fgiunchedi, observability, Operations, Traffic
ema triaged T227668: Per-backend ATS Prometheus metrics as Normal priority.
Wed, Jul 10, 2:05 PM · User-fgiunchedi, observability, Operations, Traffic
ema updated the task description for T187716: Sunset Wikipedia Zero.
Wed, Jul 10, 1:56 PM · Patch-For-Review, MW-1.34-notes (1.34.0-wmf.14; 2019-07-16), Release-Engineering-Team-TODO (201907), Epic, Reading-Infrastructure-Team-Backlog, Wikimedia-Site-requests
ema closed T213769: Zero VCL removal, a subtask of T187716: Sunset Wikipedia Zero, as Resolved.
Wed, Jul 10, 1:55 PM · Patch-For-Review, MW-1.34-notes (1.34.0-wmf.14; 2019-07-16), Release-Engineering-Team-TODO (201907), Epic, Reading-Infrastructure-Team-Backlog, Wikimedia-Site-requests
ema closed T213769: Zero VCL removal as Resolved.

This is now done!

Wed, Jul 10, 1:55 PM · Patch-For-Review, Zero, Operations, Traffic

Tue, Jul 9

ema changed the status of T213769: Zero VCL removal, a subtask of T187716: Sunset Wikipedia Zero, from Stalled to Open.
Tue, Jul 9, 1:27 PM · Patch-For-Review, MW-1.34-notes (1.34.0-wmf.14; 2019-07-16), Release-Engineering-Team-TODO (201907), Epic, Reading-Infrastructure-Team-Backlog, Wikimedia-Site-requests
ema changed the status of T213769: Zero VCL removal from Stalled to Open.
Tue, Jul 9, 1:27 PM · Patch-For-Review, Zero, Operations, Traffic
ema closed T224119: ATS is currently adding its own server header as Resolved.

ATS now sets Server only if missing in the origin server response. Also, Varnish now does not send Via any longer (the header wasn't used at all).

Tue, Jul 9, 8:50 AM · Operations, Traffic

Mon, Jul 8

ema closed T226589: Replace Varnish backends with ATS on cache upload nodes as Resolved.

All cache_upload nodes are now using ATS instead of Varnish for on-disk caching. Closing.

Mon, Jul 8, 9:42 AM · Operations, Traffic
ema closed T227328: Rename role::cache::upload_ats to role::cache::upload, a subtask of T226589: Replace Varnish backends with ATS on cache upload nodes, as Resolved.
Mon, Jul 8, 9:41 AM · Operations, Traffic
ema closed T227328: Rename role::cache::upload_ats to role::cache::upload as Resolved.

Done.

Mon, Jul 8, 9:41 AM · Patch-For-Review, Operations, Traffic
ema updated the task description for T227432: Replace Varnish backends with ATS on cache text nodes.
Mon, Jul 8, 8:17 AM · Patch-For-Review, Traffic, Operations
ema moved T227432: Replace Varnish backends with ATS on cache text nodes from Triage to Caching on the Traffic board.
Mon, Jul 8, 8:15 AM · Patch-For-Review, Traffic, Operations
ema triaged T227432: Replace Varnish backends with ATS on cache text nodes as Normal priority.
Mon, Jul 8, 8:14 AM · Patch-For-Review, Traffic, Operations
ema created T227432: Replace Varnish backends with ATS on cache text nodes.
Mon, Jul 8, 8:14 AM · Patch-For-Review, Traffic, Operations

Fri, Jul 5

ema moved T227328: Rename role::cache::upload_ats to role::cache::upload from Triage to Caching on the Traffic board.
Fri, Jul 5, 1:44 PM · Patch-For-Review, Operations, Traffic
ema triaged T227328: Rename role::cache::upload_ats to role::cache::upload as Normal priority.
Fri, Jul 5, 1:43 PM · Patch-For-Review, Operations, Traffic
ema created T227328: Rename role::cache::upload_ats to role::cache::upload.
Fri, Jul 5, 1:43 PM · Patch-For-Review, Operations, Traffic
ema closed T222620: cp1083 crashed as Resolved.

The host has been in production for weeks without issues now. Closing.

Fri, Jul 5, 1:22 PM · Operations, ops-eqiad, Traffic
ema closed T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad, a subtask of T226589: Replace Varnish backends with ATS on cache upload nodes, as Resolved.
Fri, Jul 5, 1:06 PM · Operations, Traffic
ema closed T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad as Resolved.

With the conversion of cp1090 this is now done.

Fri, Jul 5, 1:06 PM · Operations, Traffic
ema renamed T225786: Increased number of webrequest sequence-numbers alarms (mostly) on upload webrequest-source from Investigate varnish behavior change since new ATS-change in webrequest upload to Increased number of webrequest sequence-numbers alarms (mostly) on upload webrequest-source.
Fri, Jul 5, 12:05 PM · Traffic, Analytics, Operations
ema added a comment to T225998: Study performance impact of disabling TCP selective acknowledgments.

@Gilles: is there anything left to be done here? Other than blogging about the results that is. :-)

Fri, Jul 5, 12:02 PM · Patch-For-Review, Traffic, Performance-Team, Performance, Operations
ema moved T226444: rack/setup/install ganeti400[123] from Triage to General on the Traffic board.
Fri, Jul 5, 12:00 PM · Traffic, Operations
ema moved T189333: Changing Kibana filters is ridiculously slow from Triage to Watching on the Traffic board.
Fri, Jul 5, 11:59 AM · Traffic, Operations, Patch-For-Review, User-Addshore, Wikimedia-Logstash
ema moved T226840: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) from Triage to Watching on the Traffic board.
Fri, Jul 5, 11:59 AM · TimedMediaHandler, MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), Wikimedia-Incident, Performance-Team (Radar), Traffic, MediaWiki-extensions-CentralAuth, Operations

Thu, Jul 4

ema closed T198152: Size of headers processed by varnish?, a subtask of T197281: Fix failing webrequest hours (upload and text 2018-06-14-11), as Resolved.
Thu, Jul 4, 2:38 PM · Analytics-Kanban
ema closed T198152: Size of headers processed by varnish? as Resolved.

The maximum allowed request header size (field name + value) is now 8192 bytes. Closing.

Thu, Jul 4, 2:38 PM · Operations, Traffic, Analytics
ema updated the task description for T212772: Track remaining trusty servers in production.
Thu, Jul 4, 11:57 AM · cloud-services-team (Kanban), Operations
ema changed the status of T213769: Zero VCL removal, a subtask of T187716: Sunset Wikipedia Zero, from Open to Stalled.
Thu, Jul 4, 11:56 AM · Patch-For-Review, MW-1.34-notes (1.34.0-wmf.14; 2019-07-16), Release-Engineering-Team-TODO (201907), Epic, Reading-Infrastructure-Team-Backlog, Wikimedia-Site-requests
ema changed the status of T213769: Zero VCL removal from Open to Stalled.

Yeah, it's mostly just blocked on us making some time to deal with it, and time has been in extremely short supply lately, so we tend not to prioritize anything that doesn't have imminent impact. There's some subtleties to backing out that stuff in stages and not breaking things.

Thu, Jul 4, 11:56 AM · Patch-For-Review, Zero, Operations, Traffic

Wed, Jul 3

ema renamed T226840: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) from Consistent HTTP 503 Varnish Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) to Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm).
Wed, Jul 3, 1:06 PM · TimedMediaHandler, MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), Wikimedia-Incident, Performance-Team (Radar), Traffic, MediaWiki-extensions-CentralAuth, Operations
ema closed T226637: Replace Varnish backends with ATS on cache upload nodes in codfw as Resolved.

Done!

Wed, Jul 3, 12:54 PM · Operations, Traffic
ema closed T226637: Replace Varnish backends with ATS on cache upload nodes in codfw, a subtask of T226589: Replace Varnish backends with ATS on cache upload nodes, as Resolved.
Wed, Jul 3, 12:54 PM · Operations, Traffic
Restricted Application added a project to T189333: Changing Kibana filters is ridiculously slow: Operations.
Wed, Jul 3, 12:51 PM · Traffic, Operations, Patch-For-Review, User-Addshore, Wikimedia-Logstash
ema added a comment to T189333: Changing Kibana filters is ridiculously slow.

I've prepared https://gerrit.wikimedia.org/r/520425 to restrict the request/response headers sent to logstash by the various varnishlogconsumer daemons.

Wed, Jul 3, 12:50 PM · Traffic, Operations, Patch-For-Review, User-Addshore, Wikimedia-Logstash

Tue, Jul 2

ema created P8700 (An Untitled Masterwork).
Tue, Jul 2, 4:09 PM
ema added a comment to T198152: Size of headers processed by varnish?.

Both varnish and nginx limit the maximum request header length to 8k by default. We have set nginx's limit to 16k, while leaving the default on varnish untouched.

Tue, Jul 2, 12:56 PM · Operations, Traffic, Analytics
ema added a comment to T222041: cp3037 is currently unreachable.

Can someone start the decommission process? this host shows up in things like debdeploy runs or cumin runs and that's distracting.

Tue, Jul 2, 11:44 AM · ops-esams, Operations, Traffic

Mon, Jul 1

ema created P8689 (An Untitled Masterwork).
Mon, Jul 1, 1:51 PM
ema added a comment to T222356: facter3: Unable to parse routing table.

The updated package in boron:~jbond/src/facter-3.11.0 looks good. IMO we can ignore to backport this to jessie? the only cp host on jessie is the obsolete cp1008 which will be migrated to stretch/new server soon.

Mon, Jul 1, 11:56 AM · Packaging, Puppet, Operations

Fri, Jun 28

ema moved T226805: nginx HTTP 500 rate increase on specific cache hosts from Triage to TLS on the Traffic board.
Fri, Jun 28, 8:30 AM · Operations, Traffic
ema triaged T226805: nginx HTTP 500 rate increase on specific cache hosts as Normal priority.
Fri, Jun 28, 8:30 AM · Operations, Traffic
ema added a comment to T226776: mobile commons GET dying in Varnish layer(?) under oddly specific conditions.

Well, the error response was surely generated by the applayer and not by varnish (the latter only generates synthetic responses with 503, not 500) . However, varnish is definitely implicated in the issue, given that essentially no response header was returned.

Fri, Jun 28, 8:29 AM · Operations, Traffic
Restricted Application added a project to T226805: nginx HTTP 500 rate increase on specific cache hosts: Operations.
Fri, Jun 28, 8:29 AM · Operations, Traffic
ema added a comment to T226776: mobile commons GET dying in Varnish layer(?) under oddly specific conditions.

the link started working again. Two things had changed at that time, the thing above and I had logged out and back in to drop my cookie. Trying with the old cookie gave a proper HTML answer as well, so couldn't have been a VCL parse error on the cookie value.

Fri, Jun 28, 6:33 AM · Operations, Traffic
ema added a comment to T226048: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster).

@Trizek-WMF: personally, I've tried and failed for days to reproduce the issue. My understanding is that occasionally some page loads for logged-in users take a very long time to complete, with characters slowly showing up on the screen.

Fri, Jun 28, 6:24 AM · CommRel-Specialists-Support (Jul-Sep-2019), User-notice, Performance-Team (Radar), Operations, Traffic, Performance
ema added a comment to T226776: mobile commons GET dying in Varnish layer(?) under oddly specific conditions.

I cannot reproduce the issue right now, but it does look like a strange interaction between the application servers and varnish.

Fri, Jun 28, 6:18 AM · Operations, Traffic
ema moved T226776: mobile commons GET dying in Varnish layer(?) under oddly specific conditions from Triage to Caching on the Traffic board.
Fri, Jun 28, 6:07 AM · Operations, Traffic
ema triaged T226776: mobile commons GET dying in Varnish layer(?) under oddly specific conditions as Normal priority.
Fri, Jun 28, 6:06 AM · Operations, Traffic

Thu, Jun 27

ema edited P8663 (An Untitled Masterwork).
Thu, Jun 27, 2:45 PM
ema created P8663 (An Untitled Masterwork).
Thu, Jun 27, 2:45 PM
ema closed T226685: HTTP 503 on zh.wikipedia.org as Resolved.

This 503 error was due to network issues in eqiad as mentioned by @Marostegui and @Antigng.

Thu, Jun 27, 10:57 AM · Operations
ema added a comment to T226318: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408.

Just bumped into another very frequent one:

Thu, Jun 27, 8:20 AM · Traffic, Operations, Thumbor, MediaWiki-File-management, Commons, Multimedia

Wed, Jun 26

ema closed T226477: Replace Varnish backends with ATS on cache upload nodes in eqsin as Resolved.

Done.

Wed, Jun 26, 4:06 PM · Operations, Traffic
ema closed T226477: Replace Varnish backends with ATS on cache upload nodes in eqsin, a subtask of T226589: Replace Varnish backends with ATS on cache upload nodes, as Resolved.
Wed, Jun 26, 4:06 PM · Operations, Traffic
ema moved T226271: Image thumbnail (cache?) broken on English Wikipedia, e.g. Information.svg, when viewing non-default resolution (e.g. 241px) from Triage to Caching on the Traffic board.
Wed, Jun 26, 1:58 PM · Operations, Traffic, Thumbor, MediaWiki-File-management, Multimedia, Commons
ema moved T226637: Replace Varnish backends with ATS on cache upload nodes in codfw from Triage to Caching on the Traffic board.
Wed, Jun 26, 1:58 PM · Operations, Traffic
ema moved T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad from Triage to Caching on the Traffic board.
Wed, Jun 26, 1:58 PM · Operations, Traffic
ema triaged T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad as Normal priority.
Wed, Jun 26, 1:58 PM · Operations, Traffic
ema created T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad.
Wed, Jun 26, 1:58 PM · Operations, Traffic
ema triaged T226637: Replace Varnish backends with ATS on cache upload nodes in codfw as Normal priority.
Wed, Jun 26, 1:57 PM · Operations, Traffic
ema created T226637: Replace Varnish backends with ATS on cache upload nodes in codfw.
Wed, Jun 26, 1:57 PM · Operations, Traffic
ema added a comment to T226077: CVE-2019-11477 / CVE-2019-11478: Linux Kernel: Multiple TCP-based remote denial of service vulnerabilities.

No, Wikimedia servers do not use Ubuntu.

I see :meta:Wikimedia servers say All our servers run either Debian or Ubuntu Server :)

Wed, Jun 26, 12:02 PM · Security
ema added a comment to T225998: Study performance impact of disabling TCP selective acknowledgments.

Remember that x-cache headers are read from right to left.

Wed, Jun 26, 6:03 AM · Patch-For-Review, Traffic, Performance-Team, Performance, Operations
ema moved T226589: Replace Varnish backends with ATS on cache upload nodes from Triage to Caching on the Traffic board.
Wed, Jun 26, 5:34 AM · Operations, Traffic
ema triaged T226589: Replace Varnish backends with ATS on cache upload nodes as Normal priority.
Wed, Jun 26, 5:34 AM · Operations, Traffic

Tue, Jun 25

ema added a comment to T226318: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408.

Note that thumbor is occasionally returning 500 for that object. Hitting ATS to skip varnish-fe transformations:

Tue, Jun 25, 12:30 PM · Traffic, Operations, Thumbor, MediaWiki-File-management, Commons, Multimedia
ema moved T226318: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 from Triage to Caching on the Traffic board.
Tue, Jun 25, 11:48 AM · Traffic, Operations, Thumbor, MediaWiki-File-management, Commons, Multimedia
ema added a project to T226318: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408: Traffic.
Tue, Jun 25, 11:48 AM · Traffic, Operations, Thumbor, MediaWiki-File-management, Commons, Multimedia
ema added a comment to T226318: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408.

@Gilles we've just had an interesting report on #wikimedia-operations that seems related.

Tue, Jun 25, 11:37 AM · Traffic, Operations, Thumbor, MediaWiki-File-management, Commons, Multimedia
ema created P8653 thumbor-429.log.
Tue, Jun 25, 11:32 AM
ema moved T226477: Replace Varnish backends with ATS on cache upload nodes in eqsin from Triage to Caching on the Traffic board.
Tue, Jun 25, 11:11 AM · Operations, Traffic
ema triaged T226477: Replace Varnish backends with ATS on cache upload nodes in eqsin as Normal priority.
Tue, Jun 25, 8:46 AM · Operations, Traffic
Restricted Application added a project to T226477: Replace Varnish backends with ATS on cache upload nodes in eqsin: Operations.
Tue, Jun 25, 8:46 AM · Operations, Traffic
ema renamed T226048: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) from Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) to Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster).
Tue, Jun 25, 8:07 AM · CommRel-Specialists-Support (Jul-Sep-2019), User-notice, Performance-Team (Radar), Operations, Traffic, Performance

Mon, Jun 24

ema added a comment to T226048: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster).

We believe that Varnish fetch failures might be related to this issue, investigation is ongoing T226375

Mon, Jun 24, 4:32 PM · CommRel-Specialists-Support (Jul-Sep-2019), User-notice, Performance-Team (Radar), Operations, Traffic, Performance
ema updated the task description for T226375: Investigate esams text varnish backend fetch failures.
Mon, Jun 24, 2:01 PM · Patch-For-Review, Operations, Traffic
ema updated the task description for T226375: Investigate esams text varnish backend fetch failures.
Mon, Jun 24, 1:36 PM · Patch-For-Review, Operations, Traffic
ema updated the task description for T226375: Investigate esams text varnish backend fetch failures.
Mon, Jun 24, 1:27 PM · Patch-For-Review, Operations, Traffic
ema updated the task description for T226375: Investigate esams text varnish backend fetch failures.
Mon, Jun 24, 1:20 PM · Patch-For-Review, Operations, Traffic