Page MenuHomePhabricator

ema (Emanuele Rocca)
Senior Site Reliability Engineer, Traffic Team

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Sep 29 2015, 8:49 PM (226 w, 14 h)
Availability
Available
IRC Nick
ema
LDAP User
Ema
MediaWiki User
Unknown

Recent Activity

Thu, Jan 23

ema created P10251 (An Untitled Masterwork).
Thu, Jan 23, 3:39 PM

Wed, Jan 22

ema closed T242411: varnish parent unable to send signals to child as Resolved.

Capability added, all frontends restarted. Closing.

Wed, Jan 22, 4:12 PM · Patch-For-Review, Operations, Traffic
ema closed T242417: varnish-fe crashes due to "Error in munmap(): Cannot allocate memory" as Resolved.

Raised vm.max_map_count and added an icinga check alerting if the number of memory map areas used by varnish is getting close to the value. Closing.

Wed, Jan 22, 12:00 PM · Patch-For-Review, Operations, Traffic

Tue, Jan 21

ema created P10235 (An Untitled Masterwork).
Tue, Jan 21, 3:34 PM
ema closed T242579: Setup netconsole on upload@esams hosts, a subtask of T238305: servers freeze across the caching cluster, as Resolved.
Tue, Jan 21, 3:31 PM · Operations, Traffic
ema closed T242579: Setup netconsole on upload@esams hosts as Resolved.

This is now done in prod. All upload@esams nodes are sending their kernel messages to a central host. See journalctl -u netconsole on ganeti3002.

Tue, Jan 21, 3:31 PM · Traffic, Operations

Fri, Jan 17

ema created P10206 (An Untitled Masterwork).
Fri, Jan 17, 2:52 PM

Thu, Jan 16

ema added a comment to T183146: Monitor resource usage on a per-cgroup basis.

I have enabled cpu, memory, and blockio cgroups accounting on cp4026 Jan 16 09:19:59 and cp4027 Jan 16 09:37:00.

Thu, Jan 16, 9:58 AM · Operations, observability
ema updated the task description for T242952: traffic_server crash upon Lua reload: attempt to concatenate a table value.
Thu, Jan 16, 9:02 AM · Operations, Traffic
ema moved T242952: traffic_server crash upon Lua reload: attempt to concatenate a table value from Triage to Caching on the Traffic board.
Thu, Jan 16, 8:54 AM · Operations, Traffic
ema triaged T242952: traffic_server crash upon Lua reload: attempt to concatenate a table value as Medium priority.
Thu, Jan 16, 8:53 AM · Operations, Traffic
ema created T242952: traffic_server crash upon Lua reload: attempt to concatenate a table value.
Thu, Jan 16, 8:53 AM · Operations, Traffic

Wed, Jan 15

ema moved T227108: Port varnishlog consumers to log to syslog / logging infra from Triage to Caching on the Traffic board.
Wed, Jan 15, 12:29 PM · Traffic, Patch-For-Review, observability, Wikimedia-Logstash, User-fgiunchedi, Operations
ema moved T242778: ATS strict round robin parent select policy doesn't work as expected from Triage to TLS on the Traffic board.
Wed, Jan 15, 12:29 PM · Operations, Traffic

Tue, Jan 14

ema added a project to T227108: Port varnishlog consumers to log to syslog / logging infra: Traffic.
Tue, Jan 14, 3:21 PM · Traffic, Patch-For-Review, observability, Wikimedia-Logstash, User-fgiunchedi, Operations
ema triaged T242620: ats-tls is having issues when varnish-fe goes away as High priority.
Tue, Jan 14, 11:11 AM · Patch-For-Review, Operations, Traffic

Mon, Jan 13

ema added a comment to T242478: Production load.php spends ~ 10% time doing output compression within PHP.

I recall that in the pre-ATS setup, we explicitly configured the interaction between applayer and traffic to not request compressed responses.

Mon, Jan 13, 1:28 PM · Patch-For-Review, Operations, Traffic, Performance-Team
ema moved T242579: Setup netconsole on upload@esams hosts from Triage to Hardware on the Traffic board.
Mon, Jan 13, 1:12 PM · Traffic, Operations
ema moved T241944: Register wikipersonas.org and redirect URL from Triage to DNS Names on the Traffic board.
Mon, Jan 13, 1:11 PM · Patch-For-Review, Design-Research, Operations, Domains, Traffic
ema moved T242250: rack/setup/install ps[12]-60[34]-eqsin from Triage to Hardware on the Traffic board.
Mon, Jan 13, 1:11 PM · Operations, ops-eqsin, Traffic
ema moved T241309: Add more detailed instructions to the "sec-advice" page from Triage to TLS on the Traffic board.
Mon, Jan 13, 1:10 PM · Traffic, Operations
ema moved T242478: Production load.php spends ~ 10% time doing output compression within PHP from Triage to Caching on the Traffic board.
Mon, Jan 13, 1:10 PM · Patch-For-Review, Operations, Traffic, Performance-Team
ema updated the task description for T242478: Production load.php spends ~ 10% time doing output compression within PHP.
Mon, Jan 13, 1:07 PM · Patch-For-Review, Operations, Traffic, Performance-Team
ema triaged T242579: Setup netconsole on upload@esams hosts as Medium priority.
Mon, Jan 13, 10:33 AM · Traffic, Operations
ema created T242579: Setup netconsole on upload@esams hosts.
Mon, Jan 13, 10:33 AM · Traffic, Operations

Fri, Jan 10

ema moved T242417: varnish-fe crashes due to "Error in munmap(): Cannot allocate memory" from Triage to Caching on the Traffic board.
Fri, Jan 10, 2:41 PM · Patch-For-Review, Operations, Traffic
ema moved T242411: varnish parent unable to send signals to child from TLS to Caching on the Traffic board.
Fri, Jan 10, 2:41 PM · Patch-For-Review, Operations, Traffic
ema moved T242411: varnish parent unable to send signals to child from Triage to TLS on the Traffic board.
Fri, Jan 10, 2:41 PM · Patch-For-Review, Operations, Traffic
ema triaged T242411: varnish parent unable to send signals to child as Medium priority.
Fri, Jan 10, 2:41 PM · Patch-For-Review, Operations, Traffic
ema updated the task description for T242417: varnish-fe crashes due to "Error in munmap(): Cannot allocate memory".
Fri, Jan 10, 11:33 AM · Patch-For-Review, Operations, Traffic
ema triaged T242417: varnish-fe crashes due to "Error in munmap(): Cannot allocate memory" as High priority.
Fri, Jan 10, 11:31 AM · Patch-For-Review, Operations, Traffic
ema created T242417: varnish-fe crashes due to "Error in munmap(): Cannot allocate memory".
Fri, Jan 10, 11:31 AM · Patch-For-Review, Operations, Traffic
ema added a comment to T224567: decom debug proxies (was: Migrate debug proxies to Stretch/Buster).

Can you confirm there's no other further pending work/tests which would make debug proxies needed again? Then I'd drop them from our environment.

Fri, Jan 10, 9:52 AM · serviceops, Operations
ema created T242411: varnish parent unable to send signals to child.
Fri, Jan 10, 9:35 AM · Patch-For-Review, Operations, Traffic
ema moved T234997: Make Netbox Active/Active from Triage to Caching on the Traffic board.
Fri, Jan 10, 8:22 AM · Patch-For-Review, Traffic, Operations
ema moved T241656: sec-warning page uses the term "Wikipedia" incorrectly from Triage to TLS on the Traffic board.
Fri, Jan 10, 8:21 AM · Voice & Tone, Operations, HTTPS, Traffic
ema moved T237165: LDF server has 404 errors for JS and CSS resources from Triage to Caching on the Traffic board.
Fri, Jan 10, 8:21 AM · Discovery-Search (Current work), Operations, Traffic, Wikidata, Wikidata-Query-Service, Discovery

Thu, Jan 9

ema added a comment to T233474: Ensure graphs used by Performance account for Varnish-to-ATS migration.

@ema If I understand correctly, varnishrls does not yet require migration because it's logged by varnish-fe instead of the (now migrated to ATS) varnish-be. Is that correct?

Thu, Jan 9, 10:29 AM · Traffic, Performance-Team, Operations, observability

Wed, Jan 8

ema raised the priority of T183146: Monitor resource usage on a per-cgroup basis from Medium to High.
Wed, Jan 8, 4:45 PM · Operations, observability
ema added a comment to T238305: servers freeze across the caching cluster.

sometimes when the IDRAC version is not up to date we might not see and log at system crash

Wed, Jan 8, 7:41 AM · Operations, Traffic

Tue, Jan 7

ema updated the task description for T237993: Create replacement for Varnishkafka.
Tue, Jan 7, 10:33 AM · Patch-For-Review, Traffic, Operations, Analytics

Mon, Jan 6

ema updated the task description for T238305: servers freeze across the caching cluster.
Mon, Jan 6, 1:24 PM · Operations, Traffic
ema added a project to T236561: "traffic" Cloud VPS project jessie deprecation: Traffic.
Mon, Jan 6, 8:54 AM · Operations, Traffic, Cloud-VPS (Debian Jessie Deprecation)
ema closed T236561: "traffic" Cloud VPS project jessie deprecation as Resolved.

@ayounsi You marked this project as "in use" in the 2019 project purge. Can you provide an estimate of when you or others may be able to address this issue?

This is just a matter of reimaging traffic-puppetmaster.traffic.eqiad.wmflabs with buster, correct? If so, I think we should be able to do this soonish, likely before the end of January.

Mon, Jan 6, 8:53 AM · Operations, Traffic, Cloud-VPS (Debian Jessie Deprecation)

Fri, Jan 3

ema closed T241653: two failing upload VTC tests as Resolved.
[*] Finding cluster...
        cp3051.esams.wmnet is a cache_upload host
Fri, Jan 3, 11:46 AM · Operations, Traffic
ema updated subscribers of T241421: Sustained periods (2-4h) of bad latency on production-search eqiad.

I believe this is caused by a bot sending a large amount of requests of type:
/w/api.php?format=json&action=query&prop=revisions&list=search&srsearch=search+query
using the UA: wikipedia (https://github.com/goldsmith/Wikipedia/)

Fri, Jan 3, 11:08 AM · Discovery-Search (Current work), Patch-For-Review, Operations, Traffic, Performance Issue, Elasticsearch
ema moved T241421: Sustained periods (2-4h) of bad latency on production-search eqiad from Triage to Caching on the Traffic board.
Fri, Jan 3, 10:05 AM · Discovery-Search (Current work), Patch-For-Review, Operations, Traffic, Performance Issue, Elasticsearch
ema triaged T241421: Sustained periods (2-4h) of bad latency on production-search eqiad as High priority.
Fri, Jan 3, 10:04 AM · Discovery-Search (Current work), Patch-For-Review, Operations, Traffic, Performance Issue, Elasticsearch

Thu, Jan 2

ema moved T241653: two failing upload VTC tests from Triage to Caching on the Traffic board.
Thu, Jan 2, 2:46 PM · Operations, Traffic
ema triaged T241653: two failing upload VTC tests as Medium priority.
Thu, Jan 2, 2:46 PM · Operations, Traffic
ema committed rLPRId57e5e11ad2d: Rename cloud_nets to public_cloud_nets (authored by ema).
Rename cloud_nets to public_cloud_nets
Thu, Jan 2, 9:20 AM
ema added a comment to T233474: Ensure graphs used by Performance account for Varnish-to-ATS migration.
Thu, Jan 2, 8:54 AM · Traffic, Performance-Team, Operations, observability
ema added a comment to T236561: "traffic" Cloud VPS project jessie deprecation.

@ayounsi You marked this project as "in use" in the 2019 project purge. Can you provide an estimate of when you or others may be able to address this issue?

Thu, Jan 2, 8:14 AM · Operations, Traffic, Cloud-VPS (Debian Jessie Deprecation)

Tue, Dec 31

ema moved T241593: cp1083: ats-tls and varnish-fe crashed due to insufficient memory from Triage to Caching on the Traffic board.
Tue, Dec 31, 8:28 AM · observability, Operations, Traffic
ema triaged T241109: wikibugs needs restart almost everyday as Medium priority.
Tue, Dec 31, 8:06 AM · Operations, Wikibugs
ema added a comment to T241109: wikibugs needs restart almost everyday.

FTR, on 2019-12-29 I also restarted wikibugs due to phab comments not showing up on irc. I did not check the logs at the time.

Tue, Dec 31, 8:06 AM · Operations, Wikibugs

Mon, Dec 30

ema triaged T241593: cp1083: ats-tls and varnish-fe crashed due to insufficient memory as High priority.
Mon, Dec 30, 3:58 PM · observability, Operations, Traffic
ema created T241593: cp1083: ats-tls and varnish-fe crashed due to insufficient memory.
Mon, Dec 30, 3:58 PM · observability, Operations, Traffic
ema committed rLPRI91b093272219: varnish: dummy acl cloud_nets (authored by ema).
varnish: dummy acl cloud_nets
Mon, Dec 30, 2:42 PM

Dec 30 2019

ema updated the title for P10016 vcl_aws_nets.py from aws_ips.py to vcl_aws_nets.py.
Dec 30 2019, 9:10 AM
ema created P10016 vcl_aws_nets.py.
Dec 30 2019, 8:59 AM

Dec 29 2019

ema updated the task description for T238305: servers freeze across the caching cluster.
Dec 29 2019, 11:14 AM · Operations, Traffic
ema added a comment to T238305: servers freeze across the caching cluster.

cp3061 crashed today, yet another cache_upload node in esams, continuing the trend mentioned in T241306#5759233. DC-Ops: is there anything you can think of that differentiates esams upload hosts, cp30(5[13579]|6[135]), from text cp30(5[02468]|6[024])? An obvious one is network utilization, significantly higher on upload hosts, but maybe there's something else hardware-related that we're overlooking?

Dec 29 2019, 11:08 AM · Operations, Traffic
ema updated the task description for T238305: servers freeze across the caching cluster.
Dec 29 2019, 10:55 AM · Operations, Traffic
ema moved T233474: Ensure graphs used by Performance account for Varnish-to-ATS migration from Triage to Caching on the Traffic board.
Dec 29 2019, 10:53 AM · Traffic, Performance-Team, Operations, observability

Dec 23 2019

ema created P10011 puppetvagrant.diff.
Dec 23 2019, 2:12 PM
ema moved T241306: cp3051 crashed from Triage to Hardware on the Traffic board.
Dec 23 2019, 9:13 AM · Traffic, Operations
ema added a comment to T241232: High CPU usage for ats-be ET_NET thread handling PURGE requests on cache_text.

With the migration to ATS we have moved the URI Path Normalization implementation from native (VCL -> C) to Lua code, so we thought this might have had an impact when it comes to this ticket. I've tried disabling normalize-path.lua altogether on cp2023 by removing it from all remap rules. There seems to be no significant change in CPU usage compared to another text@codfw host, and this would indicate that there is no point in spending much time looking at Lua-level optimizations, at least when it comes to normalize-path.lua itself.

Dec 23 2019, 9:08 AM · Traffic, Operations

Dec 22 2019

ema added a comment to T241306: cp3051 crashed.

Thanks @Volans for taking care of this.

Nothing in racadm, checked both getsel and lclog view. Nothing in syslog & co.

Dec 22 2019, 10:21 AM · Traffic, Operations

Dec 20 2019

ema moved T241239: Cleanup after varnish-be -> ats-be migration from Triage to Caching on the Traffic board.
Dec 20 2019, 1:51 PM · Patch-For-Review, Operations, Traffic
ema triaged T241239: Cleanup after varnish-be -> ats-be migration as Medium priority.
Dec 20 2019, 1:51 PM · Patch-For-Review, Operations, Traffic
ema created T241239: Cleanup after varnish-be -> ats-be migration.
Dec 20 2019, 1:51 PM · Patch-For-Review, Operations, Traffic
ema moved T241233: ats-be: consider increasing accept threads or moving accept from dedicated thread to workers from Triage to Caching on the Traffic board.
Dec 20 2019, 1:29 PM · Operations, Traffic
ema triaged T241233: ats-be: consider increasing accept threads or moving accept from dedicated thread to workers as Medium priority.
Dec 20 2019, 1:24 PM · Operations, Traffic
ema renamed T241233: ats-be: consider increasing accept threads or moving accept from dedicated thread to workers from ats-be: consider moving accept from dedicated thread to workers to ats-be: consider increasing accept threads or moving accept from dedicated thread to workers.
Dec 20 2019, 1:24 PM · Operations, Traffic
ema created T241233: ats-be: consider increasing accept threads or moving accept from dedicated thread to workers.
Dec 20 2019, 1:23 PM · Operations, Traffic
ema triaged T241232: High CPU usage for ats-be ET_NET thread handling PURGE requests on cache_text as Medium priority.
Dec 20 2019, 12:54 PM · Traffic, Operations
ema moved T241232: High CPU usage for ats-be ET_NET thread handling PURGE requests on cache_text from Triage to Caching on the Traffic board.
Dec 20 2019, 12:54 PM · Traffic, Operations
ema created T241232: High CPU usage for ats-be ET_NET thread handling PURGE requests on cache_text.
Dec 20 2019, 12:53 PM · Traffic, Operations
ema triaged T240446: Donate wikiźródła.pl and wikisłownik.pl to the Foundation as Medium priority.
Dec 20 2019, 12:37 PM · Patch-For-Review, Domains, Traffic, DNS, Operations
ema triaged T240813: HTTPS/Browser Recommendations page on Wikitech is outdated as Medium priority.
Dec 20 2019, 12:36 PM · Operations, Traffic
ema moved T218308: Add gerrit.wikimedia.org to the Phabricator CSP from Triage to Watching on the Traffic board.
Dec 20 2019, 12:35 PM · ContentSecurityPolicy, Traffic, Security-Team, Operations, Phabricator, Gerrit
ema moved T240863: Secure shared ticket key rotation for anycast authdns from Triage to DNS Infra on the Traffic board.
Dec 20 2019, 12:34 PM · Operations, Traffic
ema triaged T240863: Secure shared ticket key rotation for anycast authdns as Medium priority.
Dec 20 2019, 12:34 PM · Operations, Traffic
ema moved T240866: Create a system for distributed shared secret material to server tmps from Triage to General on the Traffic board.
Dec 20 2019, 12:34 PM · Operations, Traffic
ema triaged T240866: Create a system for distributed shared secret material to server tmps as Medium priority.
Dec 20 2019, 12:34 PM · Operations, Traffic
ema moved T241132: wikimedia.community domain name is not resolving an mx record from Triage to DNS Names on the Traffic board.
Dec 20 2019, 12:33 PM · Mail, Traffic, Operations, DNS
ema moved T196558: Send X-Analytics information from Varnish to Hadoop with VCL_Log from Triage to Caching on the Traffic board.
Dec 20 2019, 12:33 PM · Traffic, Operations, Performance-Team (Radar), Analytics
ema triaged T241132: wikimedia.community domain name is not resolving an mx record as Medium priority.
Dec 20 2019, 12:33 PM · Mail, Traffic, Operations, DNS
ema triaged T241145: Improve ATS backend connection reuse against origin servers as Medium priority.
Dec 20 2019, 12:33 PM · Performance-Team (Radar), Patch-For-Review, Traffic, Operations
ema added a project to T196558: Send X-Analytics information from Varnish to Hadoop with VCL_Log: Traffic.
Dec 20 2019, 9:51 AM · Traffic, Operations, Performance-Team (Radar), Analytics
ema added a comment to T196558: Send X-Analytics information from Varnish to Hadoop with VCL_Log.

Since the migration work is starting in Q3, it sounds like this task would be obsolete at some point in 2020, right? I guess as part of the migration it would be nice to make it a requirement to get rid of those headers being sent to end users, granted that ATS is capable of doing that.

Dec 20 2019, 9:48 AM · Traffic, Operations, Performance-Team (Radar), Analytics

Dec 19 2019

ema awarded Blog Post: The journey to Prometheus 2 a Burninate token.
Dec 19 2019, 3:26 PM
ema updated subscribers of T238494: 15% response start regression as of 2019-11-11 (Varnish->ATS).

There are thus two fronts to work on now: (1) increase connection reuse, and (2) decrease the cost of establishing a new connection. There are obvious low-hanging fruits that come to mind for (2): we're currently using weighted round-robin as the load balancing policy for appservers.discovery.wmnet. Instead of that, we should switch to consistent hashing based on the client (ats-be) IP, enable TCP Fast Open, and make sure we're reusing TLS session appropriately.

Dec 19 2019, 2:29 PM · Wikimedia-Incident, Patch-For-Review, Performance-Team, Traffic, Operations
ema moved T240950: Write side of ats-tls named pipe deleted upon logging config change reload from Triage to TLS on the Traffic board.
Dec 19 2019, 2:17 PM · Operations, Traffic
ema moved T241084: nameserver change for wikimedia.sk from Triage to DNS Names on the Traffic board.
Dec 19 2019, 2:17 PM · Operations, Domains, Traffic
ema moved T241001: cp3050 depooled due to explosion in CPU usage and inuse sockets from Triage to Caching on the Traffic board.
Dec 19 2019, 2:17 PM · Wikimedia-Incident, Traffic, Operations
ema moved T241145: Improve ATS backend connection reuse against origin servers from Triage to Caching on the Traffic board.
Dec 19 2019, 2:17 PM · Performance-Team (Radar), Patch-For-Review, Traffic, Operations
ema closed T238817: Request routing to active/passive services active in codfw only stopped working as Resolved.

Having finished the transition to ATS T227432, there is no routing between cache backends anymore.

Dec 19 2019, 2:16 PM · Operations, Traffic
ema closed T227432: Replace Varnish backends with ATS on cache text nodes as Resolved.

cp2023 and cp1089 were the last two hosts running Varnish as backend cache. We now have exclusively ats-be across the fleet!

Dec 19 2019, 2:12 PM · Patch-For-Review, Operations, Traffic