ema (Emanuele Rocca)
WMF Operations Engineer

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Sep 29 2015, 8:49 PM (142 w, 1 d)
Availability
Available
IRC Nick
ema
LDAP User
Ema
MediaWiki User
Unknown

Recent Activity

Fri, Jun 15

ema updated the task description for T164609: Merge cache_misc into cache_text functionally.
Fri, Jun 15, 7:31 AM · Patch-For-Review, Traffic, Operations

Thu, Jun 14

ema added a comment to T196974: cp3037 is currently unreachable.

The host and its management interface are back online.

Thu, Jun 14, 4:20 PM · Traffic, Operations, ops-esams
ema edited P7258 AES128-SHA-deprecation.vtc.
Thu, Jun 14, 3:04 PM
ema created P7258 AES128-SHA-deprecation.vtc.
Thu, Jun 14, 3:04 PM

Wed, Jun 13

ema edited P7253 labs-text-node.yaml.
Wed, Jun 13, 9:53 AM
ema edited P7253 labs-text-node.yaml.
Wed, Jun 13, 9:50 AM
ema edited P7253 labs-text-node.yaml.
Wed, Jun 13, 9:32 AM
ema edited P7253 labs-text-node.yaml.
Wed, Jun 13, 9:31 AM
ema created P7253 labs-text-node.yaml.
Wed, Jun 13, 9:25 AM

Mon, Jun 11

ema closed T196355: Package libvmod-re2, a subtask of T164609: Merge cache_misc into cache_text functionally, as Resolved.
Mon, Jun 11, 8:50 AM · Patch-For-Review, Traffic, Operations
ema closed T196355: Package libvmod-re2 as Resolved.

libvmod-re2 is now available on apt.w.o, closing.

Mon, Jun 11, 8:50 AM · Patch-For-Review, Traffic, Operations

Mon, Jun 4

ema updated subscribers of T164609: Merge cache_misc into cache_text functionally.
Mon, Jun 4, 4:31 PM · Patch-For-Review, Traffic, Operations
ema moved T196030: troubleshoot cr3/cr4 link from Triage to Network on the Traffic board.
Mon, Jun 4, 3:38 PM · Operations, ops-ulsfo, netops, Traffic
ema moved T196355: Package libvmod-re2 from Triage to Caching on the Traffic board.
Mon, Jun 4, 11:33 AM · Patch-For-Review, Traffic, Operations
ema triaged T196355: Package libvmod-re2 as Normal priority.
Mon, Jun 4, 11:33 AM · Patch-For-Review, Traffic, Operations
ema moved T196066: Add prometheus metrics for varnishkafka instances running on caching hosts from Triage to Caching on the Traffic board.
Mon, Jun 4, 9:09 AM · Traffic, Operations, Analytics
ema moved T196248: TLS certificates renewal process from Triage to TLS on the Traffic board.
Mon, Jun 4, 9:09 AM · Performance-Team (Radar), HTTPS, Traffic, Operations

Fri, Jun 1

ema updated the task description for T164609: Merge cache_misc into cache_text functionally.
Fri, Jun 1, 2:43 PM · Patch-For-Review, Traffic, Operations
ema created P7203 additional-vcl.vtc.
Fri, Jun 1, 2:20 PM
ema updated the task description for T164609: Merge cache_misc into cache_text functionally.
Fri, Jun 1, 1:46 PM · Patch-For-Review, Traffic, Operations
ema updated the task description for T164609: Merge cache_misc into cache_text functionally.
Fri, Jun 1, 1:44 PM · Patch-For-Review, Traffic, Operations
ema created P7201 cache-text-misc.vcl.
Fri, Jun 1, 1:19 PM

Thu, May 31

ema moved T195923: rack/setup/install cp1075-cp1090 from Triage to Hardware on the Traffic board.
Thu, May 31, 9:06 AM · ops-eqiad, Traffic, Operations

Wed, May 30

ema added a comment to T195981: require_package should mark packages as manually installed.

There's an open upstream bug for this too: https://tickets.puppetlabs.com/browse/PUP-6631

Wed, May 30, 4:08 PM · Operations, Puppet
ema updated the task description for T195981: require_package should mark packages as manually installed.
Wed, May 30, 3:12 PM · Operations, Puppet
ema triaged T195981: require_package should mark packages as manually installed as Normal priority.
Wed, May 30, 3:11 PM · Operations, Puppet
ema created T195981: require_package should mark packages as manually installed.
Wed, May 30, 3:11 PM · Operations, Puppet
ema added a comment to T127825: Re-add intel-microcode.

The following cache hosts have been running with updated microcodes for the past two days:

Wed, May 30, 2:28 PM · Patch-For-Review, Operations
ema added a comment to T127825: Re-add intel-microcode.

It would be useful to check if a new microcode is available (and thus a system restart is needed). Something along these lines should do the trick:

Wed, May 30, 1:37 PM · Patch-For-Review, Operations

Mon, May 28

ema added a comment to T184942: Deprecate python varnish cachestats.

varnishrls removed, thanks @Krinkle.

Mon, May 28, 11:34 AM · Patch-For-Review, Traffic, User-fgiunchedi, Goal, Operations
ema changed the status of T184942: Deprecate python varnish cachestats from Stalled to Open.
Mon, May 28, 11:33 AM · Patch-For-Review, Traffic, User-fgiunchedi, Goal, Operations
ema changed the status of T184942: Deprecate python varnish cachestats, a subtask of T177199: Add Prometheus client support for varnish/statsd metrics daemons, from Stalled to Open.
Mon, May 28, 11:33 AM · Patch-For-Review, Traffic, User-fgiunchedi, Goal, Operations
ema moved T195568: HTTP 404 on stats.wikipedia.org (Domain not served) from Triage to Watching on the Traffic board.
Mon, May 28, 9:49 AM · Analytics, Analytics-Wikistats, Operations, Traffic, Domains
ema moved T195365: cp intermittent IPsec MTU issue from Triage to Network on the Traffic board.
Mon, May 28, 9:47 AM · netops, Traffic, Operations

Wed, May 23

ema moved T195327: Normalise the Accept-Language header for REST API requests from Triage to Caching on the Traffic board.
Wed, May 23, 7:40 AM · Services (done), Traffic, RESTBase-API, Operations

May 22 2018

ema moved T194965: gdnsd plugin support for ACME DNS challenges from Triage to DNS Infra on the Traffic board.
May 22 2018, 8:32 AM · Traffic, Operations

May 17 2018

ema updated the task description for T194814: Reduce amount of headers sent from web responses.
May 17 2018, 2:17 PM · Performance-Team (Radar), Patch-For-Review, media-storage, Operations, Traffic

May 16 2018

ema moved T194814: Reduce amount of headers sent from web responses from Triage to Caching on the Traffic board.
May 16 2018, 10:19 AM · Performance-Team (Radar), Patch-For-Review, media-storage, Operations, Traffic
ema triaged T194814: Reduce amount of headers sent from web responses as Normal priority.
May 16 2018, 10:18 AM · Performance-Team (Radar), Patch-For-Review, media-storage, Operations, Traffic
ema created T194814: Reduce amount of headers sent from web responses.
May 16 2018, 10:18 AM · Performance-Team (Radar), Patch-For-Review, media-storage, Operations, Traffic
ema updated the task description for T192368: Unconditional return(deliver) in vcl_hit.
May 16 2018, 9:10 AM · Patch-For-Review, Operations, Traffic
ema moved T194724: Deprecate `base::service_unit` in puppet from Triage to General on the Traffic board.
May 16 2018, 8:41 AM · Patch-For-Review, cloud-services-team, User-Joe, Traffic, Cloud-Services, Operations, Puppet
ema triaged T194724: Deprecate `base::service_unit` in puppet as Normal priority.
May 16 2018, 8:41 AM · Patch-For-Review, cloud-services-team, User-Joe, Traffic, Cloud-Services, Operations, Puppet
ema triaged T194757: cp1068 memory correctable errors as Normal priority.
May 16 2018, 8:41 AM · ops-eqiad, Traffic, Operations
ema moved T194757: cp1068 memory correctable errors from Triage to Hardware on the Traffic board.
May 16 2018, 8:40 AM · ops-eqiad, Traffic, Operations

May 14 2018

ema moved T194380: Identify bots using AES128-SHA maintainers running on toolforge from Triage to TLS on the Traffic board.
May 14 2018, 9:01 AM · Operations, Traffic

May 7 2018

ema moved T192206: Remove wildcard vhost for *.wikimedia.org from Triage to Watching on the Traffic board.
May 7 2018, 9:49 AM · Patch-For-Review, Operations, Wikimedia-Apache-configuration, Traffic
ema moved T193521: Consider adding expect-CT: header to enforce certificate transparency from Triage to TLS on the Traffic board.
May 7 2018, 9:48 AM · Operations, Traffic
ema moved T193677: Interface errors on asw-d-codfw:xe-2/0/47 from Triage to Network on the Traffic board.
May 7 2018, 9:48 AM · ops-codfw, netops, Traffic, Operations
ema moved T193865: Enable numa_networking on all caches from Triage to General on the Traffic board.
May 7 2018, 9:48 AM · Patch-For-Review, Operations, Traffic
ema moved T193897: cr1-eqsin 4 onboard interfaces down from Triage to Network on the Traffic board.
May 7 2018, 9:47 AM · netops, Traffic, Operations

May 4 2018

ema triaged T193865: Enable numa_networking on all caches as Normal priority.
May 4 2018, 12:36 PM · Patch-For-Review, Operations, Traffic
ema created T193865: Enable numa_networking on all caches.
May 4 2018, 12:36 PM · Patch-For-Review, Operations, Traffic

May 2 2018

ema moved T193445: Update Media dashboard in Grafana to use Prometheus metrics from Triage to Caching on the Traffic board.
May 2 2018, 9:38 AM · Multimedia, Traffic, Operations
ema moved T193489: Refactor varnishospital and varnishslowlog from Triage to Caching on the Traffic board.
May 2 2018, 9:38 AM · Patch-For-Review, Operations, Performance-Team, Traffic
ema triaged T193445: Update Media dashboard in Grafana to use Prometheus metrics as Normal priority.
May 2 2018, 9:38 AM · Multimedia, Traffic, Operations

Apr 30 2018

ema added a comment to T184942: Deprecate python varnish cachestats.

@Krinkle I've pushed https://gerrit.wikimedia.org/r/429833 to remove varnishmedia, my understanding is that there's only one dashboard currently using statsd data under media.thumbnail.varnish. We do have prometheus data that can be used to replace it. Thoughts?

Apr 30 2018, 4:34 PM · Patch-For-Review, Traffic, User-fgiunchedi, Goal, Operations

Apr 25 2018

ema awarded T191236: Resolve elasticsearch latency alerts a Love token.
Apr 25 2018, 9:14 AM · Patch-For-Review, Discovery-Search (Current work)

Apr 24 2018

ema added a comment to T192368: Unconditional return(deliver) in vcl_hit.

As of now, returning deliver instead of miss is a valid mitigation for #1799. The main drawback, as mentioned in this ticket, is that we're potentially returning stale objects.

Apr 24 2018, 1:11 PM · Patch-For-Review, Operations, Traffic

Apr 23 2018

ema added a comment to T186069: Icinga: page in case all MediaWiki are throwing 5xx.

Alerted today, real short-lived issue. Note that the alert is a single one even though its text can change over time (e.g. when more sites alert) so icinga needs to be instructed to re-alert whenever the text changes. Other improvements include printing the "worst" value found among all metrics that match the query.

15:01:47 PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster=cache_text site={codfw,ulsfo} 
15:03:37 PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
15:03:47 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
15:04:18 PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
15:05:07 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
15:05:47 RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: (No output returned from plugin)
15:11:38 RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
15:12:27 RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
15:13:07 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
15:13:48 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
Apr 23 2018, 3:34 PM · Patch-For-Review, Wikimedia-Incident, Icinga, Operations, monitoring

Apr 20 2018

ema added a comment to T185504: Netbox: add Icinga check for PostgreSQL.

We've had the following Icinga UNKNOWN on netmon2001 for the past 6 days:

Apr 20 2018, 7:03 AM · Patch-For-Review, monitoring, Operations

Apr 19 2018

ema reopened T191360: decom spare server lawrencium/WMF3542 as "Open".

Re-opening, this morning we had two icinga criticals for lawrencium and lawrencium.mgmt being down. Some decom steps seem to have been skipped.

Apr 19 2018, 8:05 AM · decommission, Operations, ops-eqiad
ema reopened T191360: decom spare server lawrencium/WMF3542, a subtask of T187473: Decommission old and unused/spare servers in eqiad, as Open.
Apr 19 2018, 8:05 AM · decommission, Operations, DC-Ops, ops-eqiad

Apr 18 2018

ema updated the task description for T192368: Unconditional return(deliver) in vcl_hit.
Apr 18 2018, 1:50 PM · Patch-For-Review, Operations, Traffic
ema triaged T192437: Pybal support of configuration from the kubernetes API as Normal priority.
Apr 18 2018, 1:31 PM · Patch-For-Review, Traffic, Operations, Prod-Kubernetes, Pybal
ema moved T191393: Puppet: tlsproxy localssl default_server make a Notify at each run from Triage to TLS on the Traffic board.
Apr 18 2018, 1:30 PM · Traffic, Operations, Puppet
ema moved T192368: Unconditional return(deliver) in vcl_hit from Triage to Caching on the Traffic board.
Apr 18 2018, 1:30 PM · Patch-For-Review, Operations, Traffic

Apr 17 2018

ema triaged T192368: Unconditional return(deliver) in vcl_hit as Normal priority.
Apr 17 2018, 3:12 PM · Patch-For-Review, Operations, Traffic
ema created T192368: Unconditional return(deliver) in vcl_hit.
Apr 17 2018, 3:11 PM · Patch-For-Review, Operations, Traffic
ema updated the title for P6970 vcl_hit-deliver-keep.vtc from untitled to vcl_hit-deliver-keep.vtc.
Apr 17 2018, 2:25 PM
ema moved T192280: sda failure in hydrogen.wikimedia.org from Triage to Hardware on the Traffic board.
Apr 17 2018, 1:25 PM · ops-eqiad, Traffic, Operations
ema created P7000 db1080 vs db1114 network interface stats.
Apr 17 2018, 9:41 AM
ema added a comment to T187014: Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country.

Here's pageview hourly after deploying the changes above:

Apr 17 2018, 9:16 AM · Zero, Patch-For-Review, Analytics-Data-Quality, Analytics-Kanban, Operations, Traffic, Analytics, Readers-Web-Backlog (Tracking), Mobile, New-Readers

Apr 16 2018

ema added a comment to T187014: Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country.

+1 let me know when it is in place and i can help check things square again on my end

Apr 16 2018, 3:05 PM · Zero, Patch-For-Review, Analytics-Data-Quality, Analytics-Kanban, Operations, Traffic, Analytics, Readers-Web-Backlog (Tracking), Mobile, New-Readers

Apr 13 2018

ema added a comment to T187014: Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country.

The Opera Mini stats issue here is definitely due to missing proxy information from zero portal. Here's what is currently being returned when calling zero.wikimedia.org's api with action=zeroportal and type=proxies:

Apr 13 2018, 3:00 PM · Zero, Patch-For-Review, Analytics-Data-Quality, Analytics-Kanban, Operations, Traffic, Analytics, Readers-Web-Backlog (Tracking), Mobile, New-Readers
ema renamed T187014: Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country from Opera mini IP addresses reassigned to Proxies information gone from Zero portal.
Apr 13 2018, 2:57 PM · Zero, Patch-For-Review, Analytics-Data-Quality, Analytics-Kanban, Operations, Traffic, Analytics, Readers-Web-Backlog (Tracking), Mobile, New-Readers
ema added a member for Traffic: Vgutierrez.
Apr 13 2018, 7:52 AM
ema moved T192087: Rename lvs* LLDP port descriptions after upgrading to stretch from Triage to Network on the Traffic board.
Apr 13 2018, 7:52 AM · netops, Traffic, Pybal, Operations
ema added a project to T192087: Rename lvs* LLDP port descriptions after upgrading to stretch: netops.
Apr 13 2018, 7:51 AM · netops, Traffic, Pybal, Operations
ema triaged T192082: lvs2006 Embedded Flash/SD-CARD iLO errors as Normal priority.
Apr 13 2018, 7:51 AM · Traffic, DC-Ops, ops-codfw, Operations

Apr 11 2018

ema moved T191940: Investigate 2018-04-10 global traffic drop from Triage to Network on the Traffic board.
Apr 11 2018, 12:44 PM · Patch-For-Review, Wikimedia-Incident, Traffic, Operations
ema triaged T191940: Investigate 2018-04-10 global traffic drop as High priority.
Apr 11 2018, 12:43 PM · Patch-For-Review, Wikimedia-Incident, Traffic, Operations
ema created P6981 (An Untitled Masterwork).
Apr 11 2018, 9:49 AM
ema created P6979 cp2022 varnish backend crash: Clock step detected.
Apr 11 2018, 7:44 AM
ema updated subscribers of T191956: Document how to fix IPMI issues on Wikitech .
Apr 11 2018, 7:31 AM · Operations, Documentation
ema renamed T191956: Document how to fix IPMI issues on Wikitech from Please document how to try fixing IPMI issues on Wikitech to Document how to fix IPMI issues on Wikitech .
Apr 11 2018, 7:30 AM · Operations, Documentation
ema created T191956: Document how to fix IPMI issues on Wikitech .
Apr 11 2018, 7:30 AM · Operations, Documentation

Apr 9 2018

ema added a comment to T187014: Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country.

...ehhh wha? We used to collect XFF on the webrequest side, and then parse it to get ip. We removed this, because @BBlack implemented XFF parsing on the varnish side. Eh?

Apr 9 2018, 4:53 PM · Zero, Patch-For-Review, Analytics-Data-Quality, Analytics-Kanban, Operations, Traffic, Analytics, Readers-Web-Backlog (Tracking), Mobile, New-Readers
ema edited P6970 vcl_hit-deliver-keep.vtc.
Apr 9 2018, 3:01 PM
ema edited P6970 vcl_hit-deliver-keep.vtc.
Apr 9 2018, 2:55 PM
ema created P6970 vcl_hit-deliver-keep.vtc.
Apr 9 2018, 2:43 PM
ema added a comment to T187014: Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country.

@ema: on our end we just look at the ip passed along via varnishkafka to geolocate, not at XFF. See:

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/refine_webrequest.hql#L88

Apr 9 2018, 8:49 AM · Zero, Patch-For-Review, Analytics-Data-Quality, Analytics-Kanban, Operations, Traffic, Analytics, Readers-Web-Backlog (Tracking), Mobile, New-Readers

Apr 6 2018

ema added a comment to T187014: Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country.

@mbaluta: note that the problem I've mentioned in my comment above is probably unrelated to the stats issue discussed here (would be good to fix it nonetheless!).

Apr 6 2018, 1:56 PM · Zero, Patch-For-Review, Analytics-Data-Quality, Analytics-Kanban, Operations, Traffic, Analytics, Readers-Web-Backlog (Tracking), Mobile, New-Readers
ema added a comment to T187014: Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country.

If you provided IP address of our server, we could at least tell whether it is coming from users of Extreme (OBML) or High (Turbo) mode.

Apr 6 2018, 1:06 PM · Zero, Patch-For-Review, Analytics-Data-Quality, Analytics-Kanban, Operations, Traffic, Analytics, Readers-Web-Backlog (Tracking), Mobile, New-Readers
ema added a comment to T187014: Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country.

Varnish5 rollout might have something to do with this? https://gerrit.wikimedia.org/r/#/c/409047/ cc @ema

Apr 6 2018, 9:27 AM · Zero, Patch-For-Review, Analytics-Data-Quality, Analytics-Kanban, Operations, Traffic, Analytics, Readers-Web-Backlog (Tracking), Mobile, New-Readers

Apr 5 2018

ema added a comment to T190090: Offload pings to dedicated server.

About kernel tuning, here are the variables we can adjust as necessary, with their default.

50 -- /proc/sys/net/ipv4/icmp_msgs_burst
1000 -- /proc/sys/net/ipv4/icmp_msgs_per_sec
1000 -- /proc/sys/net/ipv4/icmp_ratelimit
6168 -- /proc/sys/net/ipv4/icmp_ratemask
Apr 5 2018, 6:44 AM · Patch-For-Review, netops, Operations, Traffic

Apr 4 2018

ema created P6948 lvs1006-mathoid.log.
Apr 4 2018, 11:17 PM
ema created P6947 (An Untitled Masterwork).
Apr 4 2018, 11:00 PM
ema added a comment to T137979: Support brotli compression.

during that timeframe we only received AE:br requests for methods other than GET (OPTIONS, POST).

Apr 4 2018, 4:17 PM · Performance-Team (Radar), Operations, Traffic
ema added a comment to T137979: Support brotli compression.

To get a more up-to-date idea about the percentage of requests we get with AE:br, I've analyzed 30s of GET traffic on cp3033 and was surprised to find zero requests with AE:br. I've then tried to match for non-PURGE, and interestingly during that timeframe we only received AE:br requests for methods other than GET (OPTIONS, POST).

Apr 4 2018, 2:02 PM · Performance-Team (Radar), Operations, Traffic