ema (Emanuele Rocca)
WMF Operations Engineer

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Sep 29 2015, 8:49 PM (111 w, 3 d)
Availability
Available
IRC Nick
ema
LDAP User
Ema
MediaWiki User
Unknown

Recent Activity

Thu, Nov 16

ema moved T180712: VCL: handling of uncacheable responses in wikimedia-common from Triage to Caching on the Traffic board.
Thu, Nov 16, 5:27 PM · Operations, Traffic
ema triaged T180712: VCL: handling of uncacheable responses in wikimedia-common as Normal priority.
Thu, Nov 16, 5:27 PM · Operations, Traffic
ema created T180712: VCL: handling of uncacheable responses in wikimedia-common.
Thu, Nov 16, 5:26 PM · Operations, Traffic

Wed, Nov 15

ema renamed T180568: Aberrant load on instances involved in recent bootstrap from Abberant load on instances involved in recent bootstrap to Aberrant load on instances involved in recent bootstrap.
Wed, Nov 15, 3:53 PM · Services (doing), User-Eevans, Cassandra, Operations

Tue, Nov 14

ema edited P6104 cache-misc-labs-hiera.yaml.
Tue, Nov 14, 3:28 PM
ema triaged T180257: Puppet / LVS: confusion in service vs IP name as Normal priority.
Tue, Nov 14, 11:13 AM · Operations, Traffic
ema moved T180257: Puppet / LVS: confusion in service vs IP name from Triage to LoadBalancer on the Traffic board.
Tue, Nov 14, 9:41 AM · Operations, Traffic
ema moved T180269: Wikimedia's recent upgrade to nginx v. 1.13.6 breaks older Android HTTP libraries from Triage to TLS on the Traffic board.
Tue, Nov 14, 9:41 AM · Traffic, Wikimedia-General-or-Unknown, HTTPS, Operations
ema moved T180407: Change "CP" cookie from subdomain to project level from Triage to Caching on the Traffic board.
Tue, Nov 14, 9:41 AM · Operations, Traffic
ema moved T180424: cp3048 crashed from Triage to Hardware on the Traffic board.
Tue, Nov 14, 9:41 AM · Operations, ops-esams, Traffic
ema moved T180433: Upgrade cache_upload to Varnish 5 from Triage to Caching on the Traffic board.
Tue, Nov 14, 9:41 AM · Performance-Team (Radar), Traffic, Operations
ema moved T180434: Uncacheable content handling: hfp vs hfm from Triage to Caching on the Traffic board.
Tue, Nov 14, 9:41 AM · Patch-For-Review, Operations, Traffic
ema triaged T180434: Uncacheable content handling: hfp vs hfm as Normal priority.
Tue, Nov 14, 9:41 AM · Patch-For-Review, Operations, Traffic
ema created T180434: Uncacheable content handling: hfp vs hfm.
Tue, Nov 14, 9:41 AM · Patch-For-Review, Operations, Traffic
ema created T180433: Upgrade cache_upload to Varnish 5.
Tue, Nov 14, 9:32 AM · Performance-Team (Radar), Traffic, Operations

Mon, Nov 13

ema updated subscribers of T180329: Add CI to all operations/software/varnish/* repositories and archive obsolete ones.

I've updated the task description with comments about all repos. They're all debian packages with the exception of varnishkafka/testing.

Mon, Nov 13, 1:13 PM · Operations, Traffic, Continuous-Integration-Config
ema updated the task description for T180329: Add CI to all operations/software/varnish/* repositories and archive obsolete ones.
Mon, Nov 13, 1:08 PM · Operations, Traffic, Continuous-Integration-Config
ema triaged T180329: Add CI to all operations/software/varnish/* repositories and archive obsolete ones as Normal priority.
Mon, Nov 13, 12:59 PM · Operations, Traffic, Continuous-Integration-Config
ema triaged T180179: Evaluate the possibility to add Juniper images to Openstack as Normal priority.
Mon, Nov 13, 12:58 PM · cloud-services-team (Kanban), Cloud-VPS, netops, Operations, Traffic
ema moved T180329: Add CI to all operations/software/varnish/* repositories and archive obsolete ones from Triage to Caching on the Traffic board.
Mon, Nov 13, 12:54 PM · Operations, Traffic, Continuous-Integration-Config

Fri, Nov 10

ema moved T172459: eqiad row D switch upgrade from General to Network on the Traffic board.
Fri, Nov 10, 4:38 PM · Patch-For-Review, Operations, netops, Traffic
ema moved T180179: Evaluate the possibility to add Juniper images to Openstack from General to Network on the Traffic board.
Fri, Nov 10, 4:37 PM · cloud-services-team (Kanban), Cloud-VPS, netops, Operations, Traffic
ema moved T180179: Evaluate the possibility to add Juniper images to Openstack from Triage to General on the Traffic board.
Fri, Nov 10, 4:37 PM · cloud-services-team (Kanban), Cloud-VPS, netops, Operations, Traffic
ema moved T180178: Request increased quota for traffic Cloud VPS project from Triage to General on the Traffic board.
Fri, Nov 10, 4:36 PM · netops, Traffic, Cloud-VPS (Quota-requests), Operations
ema triaged T180178: Request increased quota for traffic Cloud VPS project as Normal priority.
Fri, Nov 10, 4:36 PM · netops, Traffic, Cloud-VPS (Quota-requests), Operations
ema moved T180256: authdns prometheus metrics are not available anymore from Triage to DNS Infra on the Traffic board.
Fri, Nov 10, 4:35 PM · Patch-For-Review, monitoring, Prometheus-metrics-monitoring, Operations, Traffic
ema triaged T180256: authdns prometheus metrics are not available anymore as Normal priority.
Fri, Nov 10, 4:34 PM · Patch-For-Review, monitoring, Prometheus-metrics-monitoring, Operations, Traffic
ema created T180256: authdns prometheus metrics are not available anymore.
Fri, Nov 10, 4:34 PM · Patch-For-Review, monitoring, Prometheus-metrics-monitoring, Operations, Traffic

Thu, Nov 9

ema triaged T158604: Investigate usefulness of SameSite cookies for logged-in accounts as Normal priority.
Thu, Nov 9, 7:40 AM · Traffic, Operations, Security-Core, MediaWiki-Authentication-and-authorization
ema added a comment to T178567: Server error (500) while trying to download files from Commons from PAWS.

Anything else left to do here? Is the problem solved for you @Chicocvenancio?

Thu, Nov 9, 7:38 AM · Patch-For-Review, media-storage, Operations, Traffic, Pywikibot-Commons, PAWS
ema triaged T178567: Server error (500) while trying to download files from Commons from PAWS as Normal priority.
Thu, Nov 9, 7:35 AM · Patch-For-Review, media-storage, Operations, Traffic, Pywikibot-Commons, PAWS
ema triaged T179026: LVS IPv6 IPs should all be recorded in DNS as Normal priority.
Thu, Nov 9, 7:29 AM · Traffic, Operations
ema triaged T176875: Allow access to wdqs.svc.eqiad.wmnet on port 8888 as Normal priority.
Thu, Nov 9, 7:29 AM · Traffic, Wikidata-Query-Service, Operations, WMDE-Analytics-Engineering, User-Addshore, Wikidata, Discovery
ema removed a project from T178778: Parsoid, VisualEditor not working with SSL / HTTPS: Traffic.
Thu, Nov 9, 7:28 AM · Operations, HTTPS, Parsoid, VisualEditor
ema moved T179953: cp3043 disk failure from Caching to Hardware on the Traffic board.
Thu, Nov 9, 7:22 AM · Traffic, Operations, ops-esams
ema moved T179050: setup bast4002/WMF7218 from Triage to Watching on the Traffic board.
Thu, Nov 9, 7:20 AM · Traffic, Operations, ops-ulsfo
ema moved T179204: setup/deploy dns400[12]/wmf721[56] from Triage to Watching on the Traffic board.
Thu, Nov 9, 7:20 AM · Traffic, Operations, ops-ulsfo
ema moved T177742: Investigate Chrony as a replacement for ISC ntpd from Triage to General on the Traffic board.
Thu, Nov 9, 7:19 AM · Traffic, Operations
ema moved T180069: Pybal should be able to advertise to multiple routers from Triage to LoadBalancer on the Traffic board.
Thu, Nov 9, 7:14 AM · Pybal, Traffic, Operations
ema changed the profile image for blog The Traffic Blog.
Thu, Nov 9, 7:12 AM · Traffic

Wed, Nov 8

ema triaged T180041: Please create a phame blog for the Traffic team as Normal priority.
Wed, Nov 8, 3:19 PM · User-greg, Release-Engineering-Team (Kanban), Operations, Traffic, Phabricator
ema created T180041: Please create a phame blog for the Traffic team.
Wed, Nov 8, 3:19 PM · User-greg, Release-Engineering-Team (Kanban), Operations, Traffic, Phabricator
ema moved T179953: cp3043 disk failure from Triage to Caching on the Traffic board.
Wed, Nov 8, 10:35 AM · Traffic, Operations, ops-esams

Tue, Nov 7

ema moved T179197: Investigate what caused the the unattended varnish upgrade in Beta Cluster from Triage to Caching on the Traffic board.
Tue, Nov 7, 8:07 AM · Release-Engineering-Team (Someday), Traffic, Operations, Beta-Cluster-Infrastructure
ema moved T179156: 503 spikes and resulting API slowness starting 18:45 October 26 from Triage to Caching on the Traffic board.
Tue, Nov 7, 8:07 AM · Release-Engineering-Team (Watching / External), Patch-For-Review, Traffic, Wikimedia-Incident, Operations, ORES, Scoring-platform-team, Wikidata
ema closed T63782: Add varnish logs to logstash as Resolved.

Done.

Tue, Nov 7, 7:00 AM · Patch-For-Review, Traffic, Operations, Wikimedia-Logstash
ema closed T63782: Add varnish logs to logstash, a subtask of T63779: Add system logs to logstash (tracking), as Resolved.
Tue, Nov 7, 7:00 AM · Tracking, Wikimedia-Logstash

Mon, Oct 30

ema added a comment to T179156: 503 spikes and resulting API slowness starting 18:45 October 26.

My best hypothesis for the "unreasonable" behavior that would break under do_stream=false is that we have some URI which is abusing HTTP chunked responses to stream an indefinite response. Sort of like websockets, but using the normal HTTP protocol primitives. Client sends a request for "give me a live stream of some events or whatever", and the server periodically sends new HTTP response chunks to the client containing new bits of the event feed. Varnish has no way to distinguish this behavior from normal chunked HTTP (where the response chunks will eventually reach a natural end in a reasonable timeframe), and in the do_stream=false store-and-forward mode, Varnish would consume this chunk stream into its own memory buffers indefinitely, waiting for the stream to end before it can forward the whole thing to the client. This behavior would line up with a lot of the strange stats indicators we've seen in Varnish recently (both during this problem, and at other earlier points in time).

Mon, Oct 30, 10:14 AM · Release-Engineering-Team (Watching / External), Patch-For-Review, Traffic, Wikimedia-Incident, Operations, ORES, Scoring-platform-team, Wikidata

Sun, Oct 29

ema added a comment to T179156: 503 spikes and resulting API slowness starting 18:45 October 26.

For future reference by another opsen who might be looking at this: one of the key metrics that identifies what we've been calling the "target cache" in eqiad, the one that will (eventually) have issues due to whatever bad traffic is currently mapped through it, is by looking at the connection counts to appservers.svc.eqiad.wmnet + api-appservers.svc.eqiad.wmnet on all the eqiad cache nodes.

Sun, Oct 29, 12:33 PM · Release-Engineering-Team (Watching / External), Patch-For-Review, Traffic, Wikimedia-Incident, Operations, ORES, Scoring-platform-team, Wikidata

Fri, Oct 27

ema created P6204 (An Untitled Masterwork).
Fri, Oct 27, 2:36 PM

Thu, Oct 26

ema edited P6190 multi-layer-cl-dostream.vtc.
Thu, Oct 26, 1:59 PM
ema edited P6190 multi-layer-cl-dostream.vtc.
Thu, Oct 26, 1:50 PM
ema created P6193 (An Untitled Masterwork).
Thu, Oct 26, 1:07 PM
ema created P6192 (An Untitled Masterwork).
Thu, Oct 26, 12:57 PM
ema created P6191 (An Untitled Masterwork).
Thu, Oct 26, 12:41 PM
ema created P6190 multi-layer-cl-dostream.vtc.
Thu, Oct 26, 11:25 AM
ema closed T159429: Allow setting varnish connection timeouts in puppet as Resolved.

All varnish runtime parameters can now be specified with the profile::cache::base::be_runtime_params hiera setting.

Thu, Oct 26, 9:01 AM · Patch-For-Review, Operations, Traffic

Wed, Oct 25

ema created P6179 admission-policy.vtc.
Wed, Oct 25, 11:42 AM
ema created P6178 admissionprob.c.
Wed, Oct 25, 10:18 AM
ema created P6177 14-admission-probability.vtc.
Wed, Oct 25, 10:14 AM
ema added a comment to T174960: Varnish does not vary elasticsearch query by request body.

Actually on closer review, kibana is allowing some POST requests to a limited set of endpoints, but not your _search endpoint:

[...]

Wed, Oct 25, 8:14 AM · Operations, Traffic, Wikimedia-Logstash

Tue, Oct 24

ema edited P6171 upload-labs-hiera.yaml.
Tue, Oct 24, 4:15 PM
ema edited P6171 upload-labs-hiera.yaml.
Tue, Oct 24, 3:43 PM
ema created P6171 upload-labs-hiera.yaml.
Tue, Oct 24, 3:38 PM
ema closed T141373: Age header reset to 0 after 24 hours on varnish frontends as Resolved.

Anything left to look at here?

Tue, Oct 24, 9:39 AM · Operations, Traffic
ema awarded T178841: Beta cluster is down a Goat token.
Tue, Oct 24, 9:17 AM · User-Ryasmeen, User-greg, Patch-For-Review, Traffic, Operations, Beta-Cluster-Infrastructure

Mon, Oct 23

ema added a comment to T174960: Varnish does not vary elasticsearch query by request body.

@dbarratt can you please provide some examples, including request/response headers and body, the behavior you're seeing and the one you'd expect?

Mon, Oct 23, 3:18 PM · Operations, Traffic, Wikimedia-Logstash
ema closed T170131: Recurring varnish-be fetch failures in codfw as Resolved.

Nope!

Mon, Oct 23, 3:10 PM · netops, Traffic, Operations

Fri, Oct 20

ema edited P6104 cache-misc-labs-hiera.yaml.
Fri, Oct 20, 8:52 AM
ema edited P6104 cache-misc-labs-hiera.yaml.
Fri, Oct 20, 8:46 AM
ema edited P6104 cache-misc-labs-hiera.yaml.
Fri, Oct 20, 8:42 AM

Oct 18 2017

ema moved T178436: rack/setup/install lvs400[567].ulsfo.wmnet from Triage to LoadBalancer on the Traffic board.
Oct 18 2017, 1:50 PM · Patch-For-Review, Traffic, Operations
ema moved T178423: rack/setup/install cp40(29|3[012]).ulsfo.wmnet from Triage to Caching on the Traffic board.
Oct 18 2017, 1:50 PM · Patch-For-Review, Traffic, Operations, ops-ulsfo
ema created P6148 traffic-clusterssh.py.
Oct 18 2017, 10:28 AM

Oct 17 2017

ema created P6137 (An Untitled Masterwork).
Oct 17 2017, 1:58 PM
ema closed T177233: Upgrade cache_misc to Varnish 5 as Resolved.

Done.

Oct 17 2017, 7:46 AM · Patch-For-Review, Performance-Team (Radar), Traffic, Operations
ema closed T177233: Upgrade cache_misc to Varnish 5, a subtask of T168529: Upgrade to Varnish 5, as Resolved.
Oct 17 2017, 7:46 AM · Patch-For-Review, Performance-Team (Radar), Operations, Traffic

Oct 16 2017

ema closed T178149: RunCommandMonitoringProtocol throws an exception if runcommand.arguments is not specified as Resolved.

PyBal upgraded to 1.14.2 on all LVS hosts.

Oct 16 2017, 12:46 PM · Patch-For-Review, Traffic, Operations, Pybal
ema closed T177815: Alerts on LVS services with one single realserver as Resolved.

PyBal upgraded to 1.14.2 on all LVS hosts.

Oct 16 2017, 12:46 PM · Patch-For-Review, Operations, Pybal, Traffic

Oct 13 2017

ema moved T177927: Refactor kafka_config.rb and and kafka_cluster_name.rb in puppet to avoid explicit hiera calls from Triage to Watching on the Traffic board.
Oct 13 2017, 2:28 PM · Traffic, Operations, User-Elukey, Analytics
ema moved T177961: Upgrade LVS servers to stretch from Triage to LoadBalancer on the Traffic board.
Oct 13 2017, 2:28 PM · Patch-For-Review, Traffic, Pybal, Operations
ema moved T178011: cp4026 memory error from Triage to Caching on the Traffic board.
Oct 13 2017, 2:28 PM · Traffic, Operations, ops-ulsfo
ema moved T178078: RESTBase logs disappeared from logstash from Triage to LoadBalancer on the Traffic board.
Oct 13 2017, 2:28 PM · Patch-For-Review, Traffic, Operations, Wikimedia-Logstash, Services (watching)
ema moved T178149: RunCommandMonitoringProtocol throws an exception if runcommand.arguments is not specified from Triage to LoadBalancer on the Traffic board.
Oct 13 2017, 2:28 PM · Patch-For-Review, Traffic, Operations, Pybal
ema moved T178151: Add UDP monitor for pybal from Triage to LoadBalancer on the Traffic board.
Oct 13 2017, 2:28 PM · Operations, Traffic, Pybal
ema closed T133791: check_dns needs to be rewritten as Resolved.

check_dns v1.5 (nagios-plugins 1.5) seems to be doing the right thing currently:

Oct 13 2017, 2:26 PM · Traffic, Cloud-Services, Operations
ema triaged T178151: Add UDP monitor for pybal as Normal priority.
Oct 13 2017, 10:15 AM · Operations, Traffic, Pybal
ema triaged T178149: RunCommandMonitoringProtocol throws an exception if runcommand.arguments is not specified as Normal priority.
Oct 13 2017, 9:58 AM · Patch-For-Review, Traffic, Operations, Pybal
ema created T178149: RunCommandMonitoringProtocol throws an exception if runcommand.arguments is not specified.
Oct 13 2017, 9:57 AM · Patch-For-Review, Traffic, Operations, Pybal
ema triaged T177228: Multiple systems in esams OE10 showing PSU failures as Normal priority.
Oct 13 2017, 7:09 AM · Traffic, ops-esams, DC-Ops, Operations

Oct 12 2017

ema created P6109 (An Untitled Masterwork).
Oct 12 2017, 11:49 AM

Oct 11 2017

ema triaged T177961: Upgrade LVS servers to stretch as Normal priority.
Oct 11 2017, 4:09 PM · Patch-For-Review, Traffic, Pybal, Operations
ema created T177961: Upgrade LVS servers to stretch.
Oct 11 2017, 4:08 PM · Patch-For-Review, Traffic, Pybal, Operations
ema moved T177815: Alerts on LVS services with one single realserver from Backlog to In Progress on the Pybal board.
Oct 11 2017, 3:03 PM · Patch-For-Review, Operations, Pybal, Traffic
ema created P6104 cache-misc-labs-hiera.yaml.
Oct 11 2017, 2:08 PM
ema triaged T177927: Refactor kafka_config.rb and and kafka_cluster_name.rb in puppet to avoid explicit hiera calls as Normal priority.
Oct 11 2017, 1:25 PM · Traffic, Operations, User-Elukey, Analytics

Oct 10 2017

ema moved T177815: Alerts on LVS services with one single realserver from Triage to LoadBalancer on the Traffic board.
Oct 10 2017, 6:10 AM · Patch-For-Review, Operations, Pybal, Traffic
ema triaged T177815: Alerts on LVS services with one single realserver as Normal priority.
Oct 10 2017, 6:09 AM · Patch-For-Review, Operations, Pybal, Traffic
ema created T177815: Alerts on LVS services with one single realserver.
Oct 10 2017, 6:09 AM · Patch-For-Review, Operations, Pybal, Traffic
ema moved T177199: Add Prometheus client support for varnish/statsd metrics daemons from Triage to Caching on the Traffic board.
Oct 10 2017, 5:37 AM · Patch-For-Review, Traffic, User-fgiunchedi, Goal, Operations