Page MenuHomePhabricator

ema (Emanuele Rocca)
Senior Site Reliability Engineer, Traffic Team

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Sep 29 2015, 8:49 PM (216 w, 20 h)
Availability
Available
IRC Nick
ema
LDAP User
Ema
MediaWiki User
Unknown

Recent Activity

Today

ema added a comment to T237360: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs.

This just happened on cp2023 too.

Wed, Nov 20, 4:49 PM · Traffic, Operations
ema added a comment to T237319: 502 errors on ATS/8.0.5.

On cp1075:

Wed, Nov 20, 4:07 PM · Operations, Traffic, User-DannyS712

Yesterday

ema added a comment to T238494: 200ms / 50% response start regression starting around 2019-11-11.

@Gilles To see if and to which extent ats-tls is also responsible for some of the performance degradation, you can query hadoop and check the ssl timings for cp3064. Two interesting events are Nov 12 2:54 PM (new TLS certs deployed) and Fri, Nov 15, 5:07 AM - cp3064 switched from nginx to ats-tls: https://phabricator.wikimedia.org/T231627#5666181

Tue, Nov 19, 4:51 PM · Patch-For-Review, Traffic, Operations, Performance-Team
ema added a comment to T238086: Edge cache response time per server should be monitored.

As per irc conversation with @Gilles, we do have frontend servers tagged in navtiming hadoop data. It would be very useful if we could have the information in graphite and add the cache frontends as a dropdown to https://grafana.wikimedia.org/d/000000143/navigation-timing

Tue, Nov 19, 4:43 PM · Performance-Team (Radar), Traffic, Operations
ema added a comment to T238593: Phabricator downtime due to aphlict and websockets (aphlict current disabled).

You can entirely disregard that, i was on phab1001 and not phab1003 by accident. So not the prod server.

Tue, Nov 19, 4:20 PM · Operations, Traffic, serviceops, Phabricator
ema added a comment to T238540: Delete grafana dashboard, https://grafana.wikimedia.org/d/000000599/wikibase-wb_terms-newitemidformatter.

I've observed the request failing as described in this task by using the Chromium developer tools, copied it as curl and tried it against cp1075

Tue, Nov 19, 3:28 PM · User-Addshore, Operations, observability, Traffic, Wikidata
ema moved T238540: Delete grafana dashboard, https://grafana.wikimedia.org/d/000000599/wikibase-wb_terms-newitemidformatter from Caching to TLS on the Traffic board.
Tue, Nov 19, 3:25 PM · User-Addshore, Operations, observability, Traffic, Wikidata
ema added a comment to T238540: Delete grafana dashboard, https://grafana.wikimedia.org/d/000000599/wikibase-wb_terms-newitemidformatter.

Interesting, I've observed the request failing as described in this task by using the Chromium developer tools, copied it as curl and tried it against cp1075. The dashboard did get deleted. Private info replaced with 'blah':

Tue, Nov 19, 3:07 PM · User-Addshore, Operations, observability, Traffic, Wikidata
ema moved T238540: Delete grafana dashboard, https://grafana.wikimedia.org/d/000000599/wikibase-wb_terms-newitemidformatter from Triage to Caching on the Traffic board.
Tue, Nov 19, 2:56 PM · User-Addshore, Operations, observability, Traffic, Wikidata
ema triaged T238540: Delete grafana dashboard, https://grafana.wikimedia.org/d/000000599/wikibase-wb_terms-newitemidformatter as Normal priority.
Tue, Nov 19, 2:56 PM · User-Addshore, Operations, observability, Traffic, Wikidata
ema added a comment to T238597: envoyproxy does not automatically reload certificates.

@Joe: is there any potential risk in making profile::tlsproxy::envoy::use_hot_restarter default to true as @CDanis suggested?

Tue, Nov 19, 10:52 AM · serviceops, Operations
ema merged task T236125: Trigger envoy reload upon TLS certificate update into T238597: envoyproxy does not automatically reload certificates.
Tue, Nov 19, 10:48 AM · Operations, Traffic
ema merged T236125: Trigger envoy reload upon TLS certificate update into T238597: envoyproxy does not automatically reload certificates.
Tue, Nov 19, 10:48 AM · serviceops, Operations
ema closed T233768: Enable mwdebug routes for noc.wikimedia.org as Resolved.

This is now done:

$ curl -v https://noc.wikimedia.org/Potato -H "X-Wikimedia-Debug: mwdebug1001.eqiad.wmnet" 2>&1 | egrep "(x-cache|server):"
< server: mwdebug1001.eqiad.wmnet
< x-cache: cp3052 pass, cp3054 pass

Closing!

Tue, Nov 19, 10:39 AM · Performance-Team (Radar), Traffic, Operations
ema added a comment to T238593: Phabricator downtime due to aphlict and websockets (aphlict current disabled).

By going through SAL and the irc logs on #wikimedia-operations I've reconstructed the events as follows. There are some parts I don't understand so please fill the gaps.

Tue, Nov 19, 10:00 AM · Operations, Traffic, serviceops, Phabricator
ema moved T237319: 502 errors on ATS/8.0.5 from Triage to Caching on the Traffic board.
Tue, Nov 19, 8:51 AM · Operations, Traffic, User-DannyS712
ema moved T238593: Phabricator downtime due to aphlict and websockets (aphlict current disabled) from Triage to Caching on the Traffic board.
Tue, Nov 19, 8:51 AM · Operations, Traffic, serviceops, Phabricator
ema added a project to T238593: Phabricator downtime due to aphlict and websockets (aphlict current disabled): Traffic.
Tue, Nov 19, 8:51 AM · Operations, Traffic, serviceops, Phabricator
ema triaged T238593: Phabricator downtime due to aphlict and websockets (aphlict current disabled) as Normal priority.
Tue, Nov 19, 8:51 AM · Operations, Traffic, serviceops, Phabricator

Mon, Nov 18

ema added a comment to T238494: 200ms / 50% response start regression starting around 2019-11-11.

There's been a decrease in local backend hitrate on ats-be compared to varnish-be. While on 2019-11-11 (before reimages to ats) the local hitrate was about 3.5%, today it is 1.3%:
https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?var-cluster=text&var-site=esams&from=now-15d

Mon, Nov 18, 2:38 PM · Patch-For-Review, Traffic, Operations, Performance-Team
ema moved T238494: 200ms / 50% response start regression starting around 2019-11-11 from Triage to Caching on the Traffic board.
Mon, Nov 18, 2:18 PM · Patch-For-Review, Traffic, Operations, Performance-Team

Fri, Nov 15

ema updated the task description for T227432: Replace Varnish backends with ATS on cache text nodes.
Fri, Nov 15, 3:17 PM · Patch-For-Review, Traffic, Operations
ema added a comment to T106517: upload.wikimedia.org returns HTTP status code 503 for truncated urls, not 404.

I cannot reproduce with URLs such as https://upload.wikimedia.org/wikipedia/commons/thumb/6/6b/Kitagawa_Utamaro_-_Toji_san_bijin_%28Three_Beauties_of_the_Present_Day%29From_Bijin-ga_%28Pictures_of_Beautiful_Women%29,_published_by_Tsutaya_Juzaburo_-_Google_Art_Project.jpg/200px-Kitagawa_Utamaro_-_Toji_san_bijin_%28Three_Beauties_of_the_Present_Day%29From_Bijin-ga_%28Pictures_of_Beautiful_Women%29,_published_by_Tsutaya_Juzaburo_-_Google_Art_Project.jpg

Fri, Nov 15, 12:22 PM · Wikimedia-Incident, Traffic, Operations
ema updated subscribers of T120509: Cache education dashboard pages.

@awight: is there anything to do here or can we close the task?

Fri, Nov 15, 12:15 PM · Operations, Traffic, Programs-and-Events-Dashboard-Sprint 2, Varnish, Education-Program-Dashboard
ema moved T233768: Enable mwdebug routes for noc.wikimedia.org from Triage to Caching on the Traffic board.
Fri, Nov 15, 11:35 AM · Performance-Team (Radar), Traffic, Operations
ema triaged T238285: Pages whose title ends with semicolon (;) are intermittently inaccessible as Normal priority.
Fri, Nov 15, 11:02 AM · Wikimedia-General-or-Unknown, Operations, Traffic, User-DannyS712
ema moved T238285: Pages whose title ends with semicolon (;) are intermittently inaccessible from Triage to Caching on the Traffic board.
Fri, Nov 15, 11:02 AM · Wikimedia-General-or-Unknown, Operations, Traffic, User-DannyS712
ema updated subscribers of T238285: Pages whose title ends with semicolon (;) are intermittently inaccessible.

I think this has to do something with the differences between ATS and varnish normalizing request URLs (if they do any). @ema @BBlack Can you double check?

Fri, Nov 15, 10:24 AM · Wikimedia-General-or-Unknown, Operations, Traffic, User-DannyS712

Thu, Nov 14

ema committed rLPRIf263b5e3174e: Add dummy digicert-2019a keys (authored by ema).
Add dummy digicert-2019a keys
Thu, Nov 14, 2:52 PM
ema committed rLPRI9b405ce79256: Add dummy globalsign-2019a-ecdsa-unified for pcc (authored by ema).
Add dummy globalsign-2019a-ecdsa-unified for pcc
Thu, Nov 14, 1:57 PM
ema triaged T238307: ats-tls shows spikes on H/2 recv settings bad param errors as Normal priority.
Thu, Nov 14, 11:40 AM · Patch-For-Review, Operations, Traffic
ema moved T238198: In valid byte sequence: File[/etc/update-ocsp.d/hooks/trafficserver-tls-ocsp] from Triage to Watching on the Traffic board.
Thu, Nov 14, 10:20 AM · Traffic, Puppet, Operations, User-jbond
ema moved T238307: ats-tls shows spikes on H/2 recv settings bad param errors from Triage to TLS on the Traffic board.
Thu, Nov 14, 10:20 AM · Patch-For-Review, Operations, Traffic
ema closed T238200: debmonitor TLS termination as Resolved.

TLS termination configured on port 7443:

$ curl -v https://debmonitor.wikimedia.org:7443/login/ --resolve debmonitor.wikimedia.org:7443:10.64.32.62 2>&1 | grep '< HTTP'
< HTTP/2 200 
Thu, Nov 14, 10:19 AM · Operations, Traffic

Wed, Nov 13

ema triaged T238200: debmonitor TLS termination as Normal priority.
Wed, Nov 13, 11:08 AM · Operations, Traffic
ema created T238200: debmonitor TLS termination.
Wed, Nov 13, 11:08 AM · Operations, Traffic

Tue, Nov 12

ema triaged T236120: Get rid of nginx puppetization for cache upload as Normal priority.
Tue, Nov 12, 4:46 PM · Traffic, Operations
ema triaged T237993: Create replacement for Varnishkafka as Normal priority.
Tue, Nov 12, 4:12 PM · Traffic, Analytics, Operations
ema triaged T238034: Enable QUIC support on Wikimedia servers as Normal priority.
Tue, Nov 12, 4:12 PM · Operations, Traffic, HTTPS
ema triaged T238086: Edge cache response time per server should be monitored as Normal priority.
Tue, Nov 12, 4:12 PM · Performance-Team (Radar), Traffic, Operations
ema triaged T238085: Depooling single text caching server in esams had a disproportionate performance impact as Normal priority.
Tue, Nov 12, 4:11 PM · Performance-Team (Radar), Operations, Traffic
ema updated the task description for T238085: Depooling single text caching server in esams had a disproportionate performance impact.
Tue, Nov 12, 3:34 PM · Performance-Team (Radar), Operations, Traffic
ema added a comment to T238032: cp3065 crashed.

Perhaps interestingly, or maybe entirely unrelated: a couple of hours before crashing the host had a spike in cache write errors:

Tue, Nov 12, 3:28 PM · ops-esams, Operations, Traffic
ema moved T238085: Depooling single text caching server in esams had a disproportionate performance impact from Triage to Caching on the Traffic board.
Tue, Nov 12, 3:19 PM · Performance-Team (Radar), Operations, Traffic
ema moved T238086: Edge cache response time per server should be monitored from Triage to Caching on the Traffic board.
Tue, Nov 12, 3:19 PM · Performance-Team (Radar), Traffic, Operations
ema moved T238089: varnishlog consumers http request/response logging field explosion from Triage to Caching on the Traffic board.
Tue, Nov 12, 3:19 PM · Operations, Wikimedia-Logstash, Traffic
ema triaged T238089: varnishlog consumers http request/response logging field explosion as Normal priority.
Tue, Nov 12, 3:18 PM · Operations, Wikimedia-Logstash, Traffic
ema added a comment to T238089: varnishlog consumers http request/response logging field explosion.

I thought we did already address the issue with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/520425/. Evidently there's something wrong with that patch. To be continued!

Tue, Nov 12, 3:15 PM · Operations, Wikimedia-Logstash, Traffic
ema added a comment to T237425: ats-tls-restart failed on cp4027.
Nov 05 15:22:48 cp4027 systemd[1]: Starting trafficserver-tls.service...
Nov 05 15:22:50 cp4027 update-ocsp-all[24836]: touch: cannot touch '/srv/trafficserver/tls/etc/ssl_multicert.config': Read-only file system
Nov 05 15:22:50 cp4027 systemd[1]: trafficserver-tls.service: Unit cannot be reloaded because it is inactive.
Nov 05 15:22:50 cp4027 update-ocsp-all[24836]: run-parts: /etc/update-ocsp.d/hooks/trafficserver-tls-ocsp exited with return code 99

I fixed the touch issue with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/550475/, but still update-ocsp-all tries to reload trafficserver-tls.service, which fails due to the unit being inactive. @Vgutierrez: ideas for how to tackle this?

Tue, Nov 12, 2:56 PM · Operations, Traffic
ema moved T238034: Enable QUIC support on Wikimedia servers from Triage to TLS on the Traffic board.
Tue, Nov 12, 2:46 PM · Operations, Traffic, HTTPS
ema moved T237932: Remove debug proxies once all Varnish backends are gone from Triage to Caching on the Traffic board.
Tue, Nov 12, 2:45 PM · Operations, Traffic
ema moved T237993: Create replacement for Varnishkafka from Triage to Caching on the Traffic board.
Tue, Nov 12, 2:45 PM · Traffic, Analytics, Operations
ema moved T238032: cp3065 crashed from Triage to Hardware on the Traffic board.
Tue, Nov 12, 2:45 PM · ops-esams, Operations, Traffic
ema triaged T238032: cp3065 crashed as Normal priority.
Tue, Nov 12, 10:05 AM · ops-esams, Operations, Traffic

Mon, Nov 11

ema added a subtask for T227432: Replace Varnish backends with ATS on cache text nodes: T237687: ATS doesn't support X-Wikimedia-Debug.
Mon, Nov 11, 10:46 AM · Patch-For-Review, Traffic, Operations
ema added a parent task for T237687: ATS doesn't support X-Wikimedia-Debug: T227432: Replace Varnish backends with ATS on cache text nodes.
Mon, Nov 11, 10:46 AM · Performance-Team (Radar), Operations, Traffic
ema triaged T237932: Remove debug proxies once all Varnish backends are gone as Normal priority.
Mon, Nov 11, 10:37 AM · Operations, Traffic
ema created T237932: Remove debug proxies once all Varnish backends are gone.
Mon, Nov 11, 10:37 AM · Operations, Traffic
ema closed T237687: ATS doesn't support X-Wikimedia-Debug as Resolved.

The functionality is now deployed to production, a brief illustration follows.

Mon, Nov 11, 10:30 AM · Performance-Team (Radar), Operations, Traffic
ema created P9585 (An Untitled Masterwork).
Mon, Nov 11, 10:08 AM
ori awarded T236102: Can't load flame or coal graphs on performance.wikimedia.org (HTTP 502) a Love token.
Mon, Nov 11, 2:28 AM · Operations, Traffic, Performance-Team

Fri, Nov 8

ema added a comment to T214734: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002).

Notice that debug servers aren't pooled in etcd like regular production ones, so mwdebug1002 is still serving debug traffic:

Fri, Nov 8, 5:01 PM · Release-Engineering-Team, Performance-Team (Radar), serviceops, Wikimedia-production-error, User-fgiunchedi, Operations
ema committed rLPRI0c80b7552776: Add dummy globalsign-2019 keys for pcc (authored by ema).
Add dummy globalsign-2019 keys for pcc
Fri, Nov 8, 1:53 PM
ema moved T237492: Create a second text-lb IP address for test purposes from Triage to LoadBalancer on the Traffic board.
Fri, Nov 8, 11:23 AM · Traffic, Operations
ema moved T237608: ATS skipping certain logs due to lack of buffer space from Triage to Caching on the Traffic board.
Fri, Nov 8, 11:23 AM · Operations, Traffic
ema moved T237687: ATS doesn't support X-Wikimedia-Debug from Triage to Caching on the Traffic board.
Fri, Nov 8, 11:23 AM · Performance-Team (Radar), Operations, Traffic

Thu, Nov 7

ema triaged T237608: ATS skipping certain logs due to lack of buffer space as Normal priority.
Thu, Nov 7, 8:40 AM · Operations, Traffic
ema created T237608: ATS skipping certain logs due to lack of buffer space.
Thu, Nov 7, 8:40 AM · Operations, Traffic

Wed, Nov 6

ema awarded T237424: stunnel-wrap all rsync::server usage a Baby Tequila token.
Wed, Nov 6, 10:40 AM · Operations

Tue, Nov 5

ema moved T237117: Update webrequest_128 dataset in turnilo to include TLS fields once available from Triage to TLS on the Traffic board.
Tue, Nov 5, 3:36 PM · Analytics-Kanban, observability, Operations, Analytics, Traffic
ema moved T236988: ats-be on the text cluster is experiencing broken connections from Triage to Caching on the Traffic board.
Tue, Nov 5, 3:36 PM · Operations, Traffic
ema moved T237243: Network unreachable after network-online.target is brought up from Triage to General on the Traffic board.
Tue, Nov 5, 3:36 PM · netops, Operations, Traffic
ema moved T237360: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs from Triage to Caching on the Traffic board.
Tue, Nov 5, 3:35 PM · Traffic, Operations
ema moved T237425: ats-tls-restart failed on cp4027 from Triage to TLS on the Traffic board.
Tue, Nov 5, 3:35 PM · Operations, Traffic
ema added a project to T237425: ats-tls-restart failed on cp4027: Traffic.
Tue, Nov 5, 3:35 PM · Operations, Traffic
ema triaged T237425: ats-tls-restart failed on cp4027 as Normal priority.
Tue, Nov 5, 3:35 PM · Operations, Traffic
ema created T237425: ats-tls-restart failed on cp4027.
Tue, Nov 5, 3:35 PM · Operations, Traffic
ema updated the task description for T227432: Replace Varnish backends with ATS on cache text nodes.
Tue, Nov 5, 11:00 AM · Patch-For-Review, Traffic, Operations
ema added a comment to T237360: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs.

This time, after reimaging the host it did boot properly. Also, initramfs size is now in line with that of other cp5 systems:

Tue, Nov 5, 9:56 AM · Traffic, Operations
ema added a comment to T237360: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs.

As an update, cp5012 is currently reimaging (Started first puppet run phase). The initramfs looks like this right now:

Tue, Nov 5, 9:43 AM · Traffic, Operations
ema triaged T237360: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs as Normal priority.
Tue, Nov 5, 9:24 AM · Traffic, Operations
ema updated the task description for T237360: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs.
Tue, Nov 5, 8:24 AM · Traffic, Operations
ema created T237360: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs.
Tue, Nov 5, 8:22 AM · Traffic, Operations

Mon, Nov 4

ema added a project to T237243: Network unreachable after network-online.target is brought up: netops.
Mon, Nov 4, 1:02 PM · netops, Operations, Traffic
ema created T237243: Network unreachable after network-online.target is brought up.
Mon, Nov 4, 11:18 AM · netops, Operations, Traffic
ema created T237236: Important nagios-nrpe-server errors not showing up in unit journal.
Mon, Nov 4, 10:42 AM · Operations

Wed, Oct 30

ema moved T236240: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly from Triage to Watching on the Traffic board.
Wed, Oct 30, 2:43 PM · Traffic, Operations, Thumbor, MediaWiki-File-management, Commons, Multimedia
ema moved T236744: track NIC firmware version numbers across the fleet from Triage to Watching on the Traffic board.
Wed, Oct 30, 2:43 PM · Patch-For-Review, Operations, Traffic
ema moved T236755: Enforce POST size limit on ats-tls from Triage to TLS on the Traffic board.
Wed, Oct 30, 2:42 PM · Traffic, Operations
ema added a comment to T236684: sre.hosts.downtime fails with "No hosts provided".

I've just observed the issue again with cp5008:

Wed, Oct 30, 12:31 PM · User-jbond, Patch-For-Review, SRE-tools, Operations

Tue, Oct 29

ema closed T236500: large number of 504 errors from ulsfo as Resolved.

It is done, yes. Thanks @Ottomata!

Tue, Oct 29, 2:45 PM · Security, ops-ulsfo, Wikidata, Traffic, Wikidata-Query-Service, Operations
ema created P9495 Source of 20190102 wikitech:File:Infrastructure_Overview.png for draw.io.
Tue, Oct 29, 2:26 PM
ema moved T236754: Discarded VCL files stuck in auto/busy state cause high number of backend probe requests from Triage to Caching on the Traffic board.
Tue, Oct 29, 10:48 AM · Operations, Traffic
ema renamed T236754: Discarded VCL files stuck in auto/busy state cause high number of backend probe requests from varnish-fe is flooding the text backend caching layer with backend probe requests to Discarded VCL files stuck in auto/busy state cause high number of backend probe requests.
Tue, Oct 29, 10:48 AM · Operations, Traffic
ema triaged T236754: Discarded VCL files stuck in auto/busy state cause high number of backend probe requests as Normal priority.
Tue, Oct 29, 10:47 AM · Operations, Traffic
ema added a comment to T236754: Discarded VCL files stuck in auto/busy state cause high number of backend probe requests.

On cache_text we have a fairly significant number of VCL files stuck in the "auto/busy" state after having been discarded by our reload script. As an example, right now we have 10 VCLs in such state on cp3050 (text), and only 2 on cp3057 (upload). They can be seen with varnishadm -n frontend vcl.list. Each VCL file keeps on running all its probes, causing the requests mentioned in this ticket. The issue seems to be known upstream but "timed out": https://github.com/varnishcache/varnish-cache/issues/2228

Tue, Oct 29, 10:46 AM · Operations, Traffic

Mon, Oct 28

ema moved T235736: cp3032 and cp3040 occasional failed fetches from Triage to Caching on the Traffic board.
Mon, Oct 28, 4:00 PM · Operations, Traffic
ema moved T235779: Implement basic routing for rest.php from Triage to Watching on the Traffic board.
Mon, Oct 28, 3:59 PM · Operations, Traffic, Patch-For-Review, CPT Initiatives (Core REST API in PHP), Core Platform Team Workboards (Green)
ema moved T236120: Get rid of nginx puppetization for cache upload from Triage to TLS on the Traffic board.
Mon, Oct 28, 3:59 PM · Traffic, Operations
ema moved T236125: Trigger envoy reload upon TLS certificate update from Triage to TLS on the Traffic board.
Mon, Oct 28, 3:58 PM · Operations, Traffic