ema (Emanuele Rocca)
WMF Operations Engineer

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Sep 29 2015, 8:49 PM (103 w, 21 h)
Availability
Available
IRC Nick
ema
LDAP User
Ema
MediaWiki User
Unknown

Recent Activity

Yesterday

ema created P6023 pybal-1.14.1-changelog.
Tue, Sep 19, 2:14 PM

Mon, Sep 18

ema awarded T175636: prometheus -> grafana stats for per-numa-node meminfo a Goat token.
Mon, Sep 18, 3:38 PM · Patch-For-Review, monitoring, Traffic, Operations
ema moved T174640: Invalid "wikimedia" family in unique devices data due to misplaced WMF-Last-Access-Global cookie from Triage to Caching on the Traffic board.
Mon, Sep 18, 8:34 AM · Traffic, Operations, Analytics
ema moved T175803: Text eqiad varnish 503 spikes from Triage to Caching on the Traffic board.
Mon, Sep 18, 8:33 AM · Patch-For-Review, Traffic, Operations

Tue, Sep 12

ema added a comment to T164768: Explicitly limit varnishd transient storage.

We're still missing caps for the upload cluster, right? (well and misc, but that case isn't all that important here).

Tue, Sep 12, 2:26 PM · Patch-For-Review, Traffic, Operations
ema moved T175636: prometheus -> grafana stats for per-numa-node meminfo from Triage to General on the Traffic board.
Tue, Sep 12, 8:42 AM · Patch-For-Review, monitoring, Traffic, Operations
ema triaged T175636: prometheus -> grafana stats for per-numa-node meminfo as Normal priority.
Tue, Sep 12, 8:42 AM · Patch-For-Review, monitoring, Traffic, Operations

Mon, Sep 11

ema moved T175319: cp1066 unexplained 503 spikes from Triage to Caching on the Traffic board.
Mon, Sep 11, 3:05 PM · Operations, Traffic

Fri, Sep 8

Dzahn awarded T169600: Enable diamond PowerDNSRecursor collector on dnsrecursors a Love token.
Fri, Sep 8, 4:03 PM · Patch-For-Review, Diamond, monitoring, Traffic, Prometheus-metrics-monitoring, Operations

Thu, Sep 7

ema added a comment to T171710: pybal: add prometheus metrics.

I know a bunch of work happened during the Wikimania hackathon, but what's the status of this?

Thu, Sep 7, 1:02 PM · Patch-For-Review, monitoring, Pybal, Operations, Traffic
ema moved T174932: Recurrent 'mailbox lag' critical alerts and 500s from Triage to Caching on the Traffic board.
Thu, Sep 7, 12:29 PM · Patch-For-Review, Operations, Traffic
ema moved T174960: Varnish does not vary elasticsearch query by request body from Triage to Caching on the Traffic board.
Thu, Sep 7, 12:29 PM · Operations, Traffic, Wikimedia-Logstash
ema moved T175203: Implement stateless TCP balancing in our LVS servers from Triage to LoadBalancer on the Traffic board.
Thu, Sep 7, 12:28 PM · Operations, Pybal, Traffic

Tue, Sep 5

ema added a comment to T174891: cp4024 kernel errors.

On September 1st:

Tue, Sep 5, 9:49 AM · ops-ulsfo, Operations, Traffic

Mon, Sep 4

ema created T174959: swift-recon-cron on ms-be203[34]: [Errno 17] File exists: '/var/lock/swift-recon-object-cron'.
Mon, Sep 4, 4:09 PM · User-fgiunchedi, media-storage, Operations
ema added a comment to T174891: cp4024 kernel errors.

Thanks @elukey! Yeah cp4024 might be having hardware issues. The system was down yesterday at 9ish AM UTC. I've power-cycled it and it came back online fine, but then after some hours it started with the lockups mentioned in this task description.

Mon, Sep 4, 7:46 AM · ops-ulsfo, Operations, Traffic
ema moved T174891: cp4024 kernel errors from Triage to Caching on the Traffic board.
Mon, Sep 4, 7:43 AM · ops-ulsfo, Operations, Traffic
ema triaged T174891: cp4024 kernel errors as Normal priority.
Mon, Sep 4, 7:43 AM · ops-ulsfo, Operations, Traffic

Fri, Sep 1

ema closed T163233: Implement Varnish-level rough ratelimiting as Resolved.

We've been using vsthrottle in prod for a while now, closing.

Fri, Sep 1, 9:27 AM · Analytics, Patch-For-Review, Operations, Traffic
ema closed T163233: Implement Varnish-level rough ratelimiting, a subtask of T169175: What is a reasonable per-IP ratelimit for maps, as Resolved.
Fri, Sep 1, 9:27 AM · Patch-For-Review, Discovery-Analysis, Operations, Traffic, Maps-Sprint, Maps, Discovery

Thu, Aug 31

ema created P5950 (An Untitled Masterwork).
Thu, Aug 31, 3:09 PM

Wed, Aug 30

ema added a comment to T174432: Unclear LVS bandwidth graph in "load balancers" dashboard.

Are the non-icmp graphs somehow LVS-specific?

Wed, Aug 30, 9:49 AM · Traffic, Operations
ema moved T174432: Unclear LVS bandwidth graph in "load balancers" dashboard from Triage to LoadBalancer on the Traffic board.
Wed, Aug 30, 9:20 AM · Traffic, Operations
ema triaged T174432: Unclear LVS bandwidth graph in "load balancers" dashboard as Normal priority.
Wed, Aug 30, 9:20 AM · Traffic, Operations

Tue, Aug 29

ema moved T172459: eqiad row D switch upgrade from Triage to General on the Traffic board.
Tue, Aug 29, 8:36 AM · Patch-For-Review, Operations, netops, Traffic
ema closed T171028: Degraded RAID on cp1008 as Resolved.

Looks good, thanks @Cmjohnson!

Tue, Aug 29, 7:06 AM · Traffic, Operations

Aug 11 2017

ema added a comment to T82849: lvs servers report 'Memory allocation problem' on bootup.
In T82849#3503596, @ema wrote:

A more general patch has been submitted by Julian Anastasov http://archive.linuxvirtualserver.org/html/lvs-devel/2017-08/msg00001.html \o/

Aug 11 2017, 8:19 PM · Traffic, Pybal, Operations

Aug 9 2017

ema updated the task description for T171710: pybal: add prometheus metrics.
Aug 9 2017, 9:26 PM · Patch-For-Review, monitoring, Pybal, Operations, Traffic

Aug 5 2017

ema added a comment to T82849: lvs servers report 'Memory allocation problem' on bootup.

A more general patch has been submitted by Julian Anastasov http://archive.linuxvirtualserver.org/html/lvs-devel/2017-08/msg00001.html \o/

Aug 5 2017, 3:05 PM · Traffic, Pybal, Operations

Aug 2 2017

ema added a comment to T171028: Degraded RAID on cp1008.

@Cmjohnson any news?

Aug 2 2017, 9:31 AM · Traffic, Operations

Aug 1 2017

ema moved T170843: Determine where to host zim files for the Android app from Triage to Caching on the Traffic board.
Aug 1 2017, 12:45 PM · Operations, Traffic, Reading-Infrastructure-Team-Backlog (Kanban), Wikipedia-Android-App-Backlog, Android-app-feature-Compilations
ema moved T99531: [Task] move wikiba.se webhosting to wikimedia misc-cluster from Triage to Caching on the Traffic board.
Aug 1 2017, 12:45 PM · Traffic, wikiba.se, Operations, Wikidata-Sprint-2016-11-08, Wikidata
ema moved T172148: Determine URL paths for Zim files from Triage to Caching on the Traffic board.
Aug 1 2017, 12:45 PM · Reading-Infrastructure-Team-Backlog (Kanban), Operations, Traffic, Wikipedia-Android-App-Backlog, Android-app-feature-Compilations
ema moved T172123: Determine how to upload Zim files to Swift infrastructure from Triage to Caching on the Traffic board.
Aug 1 2017, 12:44 PM · Patch-For-Review, Reading-Infrastructure-Team-Backlog, Operations, Traffic, Wikipedia-Android-App-Backlog, Android-app-feature-Compilations
ema moved T172116: Improve OCSP fetching and monitoring strategies from Triage to TLS on the Traffic board.
Aug 1 2017, 12:44 PM · Patch-For-Review, Operations, Traffic
ema moved T172124: PyBal Feature: progressive depooling strategy for monitored failures from Triage to LoadBalancer on the Traffic board.
Aug 1 2017, 12:44 PM · Pybal, Traffic, Operations
ema moved T172103: IPVS issues with UDP services, pybal depooling strategy from Triage to LoadBalancer on the Traffic board.
Aug 1 2017, 12:44 PM · Pybal, Traffic, Operations
ema added a comment to T82849: lvs servers report 'Memory allocation problem' on bootup.

I've sent a patch upstream covering the virtual service removal case: http://archive.linuxvirtualserver.org/html/lvs-devel/2017-07/msg00016.html

Aug 1 2017, 6:39 AM · Traffic, Pybal, Operations

Jul 31 2017

ema added a comment to T134893: Unhandled pybal error causing services to be depooled in etcd but not in lvs.

That it's happening often enough to really notice, and that most of the servers listed are low-traffic (which has tons of services), makes me think it's indicative of deeper issues, too. If PyBal can't answer these queries in under a second, we may be reaching some scaling limits in its design (the number of concurrent events happening in a single thread in twisted backlogging on cpu+waste?).

Jul 31 2017, 3:46 PM · Patch-For-Review, Operations-Software-Development, Pybal, Traffic, Operations
ema triaged T172103: IPVS issues with UDP services, pybal depooling strategy as Normal priority.
Jul 31 2017, 11:56 AM · Pybal, Traffic, Operations
ema created T172103: IPVS issues with UDP services, pybal depooling strategy.
Jul 31 2017, 11:56 AM · Pybal, Traffic, Operations
ema moved T172101: OCSP update failed for /etc/update-ocsp.d/globalsign-2016-ecdsa-unified.conf from Triage to TLS on the Traffic board.
Jul 31 2017, 11:18 AM · Operations, Traffic
ema triaged T172101: OCSP update failed for /etc/update-ocsp.d/globalsign-2016-ecdsa-unified.conf as Normal priority.
Jul 31 2017, 11:18 AM · Operations, Traffic
ema created T172101: OCSP update failed for /etc/update-ocsp.d/globalsign-2016-ecdsa-unified.conf.
Jul 31 2017, 11:18 AM · Operations, Traffic
ema moved T171967: setup/install cp4022 from Triage to Caching on the Traffic board.
Jul 31 2017, 9:20 AM · Patch-For-Review, Traffic, Operations
ema moved T171966: setup/install cp402[34] from Triage to Caching on the Traffic board.
Jul 31 2017, 9:20 AM · Patch-For-Review, Traffic, Operations
ema moved T171850: Backport ipvsadm from Triage to LoadBalancer on the Traffic board.
Jul 31 2017, 9:20 AM · Pybal, Traffic, Operations

Jul 28 2017

ema created P5819 ipvsadm-D-memory-error.diff.
Jul 28 2017, 4:50 PM
ema raised the priority of T86650: Add support for setting weight=0 when depooling from Low to High.
Jul 28 2017, 4:19 PM · Operations, Traffic, Patch-For-Review, Pybal
ema added a comment to T171421: 503 error for certain JPG thumbnail: "Backend fetch failed".

@Aklapper _usually_ traffic since this indicates varnish failure to fetch and most likely a network or varnish problem. See also https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches

What is the Y axis on that graph, incidents per unit time?

Jul 28 2017, 12:56 PM · Operations, Traffic, media-storage, Commons
ema closed T171421: 503 error for certain JPG thumbnail: "Backend fetch failed" as Resolved.

We do have occasional backend fetch failures. Closing, as this looks like a transient error.

Jul 28 2017, 8:09 AM · Operations, Traffic, media-storage, Commons

Jul 27 2017

ema added a comment to T171850: Backport ipvsadm.

+1 on the stretch upgrade, but I don't think it's very useful to keep the ticket open for future kernel updates, it'll only bitrot and who know's if there's even a new ipsvadm release when we upgrade to 4.14 next year.

Jul 27 2017, 4:44 PM · Pybal, Traffic, Operations
ema added a comment to T171850: Backport ipvsadm.

+1 on upgrading to stretch. However, we are probably gonna end up in a similar situation on stretch whenever upgrading to newer kernels, so perhaps it might still make sense to keep this ticket open to track ipvsadm backporting efforts whenever necessary?

Jul 27 2017, 2:28 PM · Pybal, Traffic, Operations
ema triaged T171850: Backport ipvsadm as Normal priority.
Jul 27 2017, 2:10 PM · Pybal, Traffic, Operations
ema created T171850: Backport ipvsadm.
Jul 27 2017, 2:10 PM · Pybal, Traffic, Operations
ema moved T154227: URLs with title query string parameter and additional query string parameters do not redirect to mobile site from Triage to Caching on the Traffic board.
Jul 27 2017, 10:41 AM · Unplanned-Sprint-Work, Readers-Web-Kanban-Board, Patch-For-Review, Traffic, Operations, Readers-Web-Backlog (Tracking), Puppet, Need-volunteer, Mobile

Jul 26 2017

ema added projects to T169600: Enable diamond PowerDNSRecursor collector on dnsrecursors: monitoring, Diamond.
Jul 26 2017, 1:14 PM · Patch-For-Review, Diamond, monitoring, Traffic, Prometheus-metrics-monitoring, Operations
ema moved T171710: pybal: add prometheus metrics from Triage to LoadBalancer on the Traffic board.
Jul 26 2017, 8:31 AM · Patch-For-Review, monitoring, Pybal, Operations, Traffic
ema triaged T171710: pybal: add prometheus metrics as Normal priority.
Jul 26 2017, 8:28 AM · Patch-For-Review, monitoring, Pybal, Operations, Traffic
ema created T171710: pybal: add prometheus metrics.
Jul 26 2017, 8:27 AM · Patch-For-Review, monitoring, Pybal, Operations, Traffic

Jul 25 2017

ema added a comment to T171028: Degraded RAID on cp1008.

@ema is it okay to take this down..most of the time the server needs a re-install after swapping /dev/sda will this be okay?

Jul 25 2017, 2:54 PM · Traffic, Operations
ema moved T171421: 503 error for certain JPG thumbnail: "Backend fetch failed" from Triage to Caching on the Traffic board.
Jul 25 2017, 9:31 AM · Operations, Traffic, media-storage, Commons
ema edited P5787 (An Untitled Masterwork).
Jul 25 2017, 9:25 AM

Jul 24 2017

ema closed T164579: Investigate nginx reload behavior as Resolved.

Closing, the problem is known and there's no perfect solution (but one nginx reload a day is much better than one every hour!).

Jul 24 2017, 2:17 PM · Patch-For-Review, Traffic, Operations
ema moved T171318: logster should not resolve statsd's IP every time it sends a metric from Watching to Caching on the Traffic board.
Jul 24 2017, 2:13 PM · Patch-For-Review, User-Elukey, Traffic, Operations
ema moved T171318: logster should not resolve statsd's IP every time it sends a metric from Triage to Watching on the Traffic board.
Jul 24 2017, 2:13 PM · Patch-For-Review, User-Elukey, Traffic, Operations
ema moved T171470: Monitor DNS delegations from Triage to DNS Infra on the Traffic board.
Jul 24 2017, 2:11 PM · DNS, Operations, Traffic
ema created P5787 (An Untitled Masterwork).
Jul 24 2017, 7:27 AM

Jul 23 2017

ema edited P5786 errorpage.html.
Jul 23 2017, 12:21 PM
ema created P5786 errorpage.html.
Jul 23 2017, 12:03 PM

Jul 22 2017

ema created P5784 (An Untitled Masterwork).
Jul 22 2017, 9:47 AM

Jul 21 2017

ema triaged T171318: logster should not resolve statsd's IP every time it sends a metric as Normal priority.
Jul 21 2017, 3:36 PM · Patch-For-Review, User-Elukey, Traffic, Operations
ema created T171318: logster should not resolve statsd's IP every time it sends a metric.
Jul 21 2017, 3:35 PM · Patch-For-Review, User-Elukey, Traffic, Operations
ema closed T151643: python-varnishapi daemons seeing "Log overrun" constantly as Resolved.

The last overrun was logged about half an hour ago.

Jul 21 2017, 3:29 PM · Patch-For-Review, Operations, Traffic
ema added a comment to T164768: Explicitly limit varnishd transient storage.

As of yesterday, varnish 4.1.7-1wm1 is deployed on all cache hosts. It includes our patch adding two counters, one for shortlived objects creation and another for uncacheable objects. I've added both counters to the varnish-transient-storage-usage dashboard.

Jul 21 2017, 12:11 PM · Patch-For-Review, Traffic, Operations
ema moved T117826: TEST: redirect small portion of unauthenticated desktop users to mobile web from Triage to Caching on the Traffic board.
Jul 21 2017, 11:31 AM · Traffic, Reading-Community-Engagement, Operations, Reading-Admin
ema moved T170605: ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION from Triage to Watching on the Traffic board.
Jul 21 2017, 11:30 AM · Thumbor, media-storage, Traffic, Operations, Commons
ema moved T166782: wikimediafoundation.org's language selector is confusing to most visitors who don't have accounts there from General to Caching on the Traffic board.
Jul 21 2017, 11:28 AM · Traffic, Operations, Wikimedia-General-or-Unknown, I18n
ema moved T166782: wikimediafoundation.org's language selector is confusing to most visitors who don't have accounts there from Triage to General on the Traffic board.
Jul 21 2017, 11:27 AM · Traffic, Operations, Wikimedia-General-or-Unknown, I18n
ema moved T171168: cp1050 apparently stuck while "Initializing firmware interfaces..." from Triage to Caching on the Traffic board.
Jul 21 2017, 11:27 AM · Operations, Traffic, ops-eqiad

Jul 20 2017

ema added a comment to T104442: Investigate better DNS cache/lookup solutions.

Forwarding-only caching resolvers would help with issues such as T171048 and T151643.

Jul 20 2017, 2:05 PM · Patch-For-Review, Traffic, Operations
ema updated subscribers of T171028: Degraded RAID on cp1008.

@Cmjohnson please replace the disk (sda) whenever you've got the chance!

Jul 20 2017, 1:56 PM · Traffic, Operations
ema triaged T171168: cp1050 apparently stuck while "Initializing firmware interfaces..." as Normal priority.
Jul 20 2017, 1:56 PM · Operations, Traffic, ops-eqiad
ema created T171168: cp1050 apparently stuck while "Initializing firmware interfaces...".
Jul 20 2017, 1:55 PM · Operations, Traffic, ops-eqiad
ema added a comment to T162612: codfw/eqiad hosts occasionally spend > 3 minutes starting networking.service with linux 4.9.

So Kconfig says that PERF_EVENTS_INTEL_CSTATE is about perf events for power monitoring. As I don't think we need it, we could blacklist it together with intel-rapl-perf (PERF_EVENTS_INTEL_RAPL) while we're at it.

Jul 20 2017, 11:52 AM · Patch-For-Review, Operations
ema added a comment to T162612: codfw/eqiad hosts occasionally spend > 3 minutes starting networking.service with linux 4.9.

Reopening, I've just seen this happening again on cp1066. This is what systemd-analyze blame reported after a slow but successful boot:

Jul 20 2017, 11:43 AM · Patch-For-Review, Operations
ema reopened T162612: codfw/eqiad hosts occasionally spend > 3 minutes starting networking.service with linux 4.9 as "Open".
Jul 20 2017, 11:41 AM · Patch-For-Review, Operations
ema reopened T162612: codfw/eqiad hosts occasionally spend > 3 minutes starting networking.service with linux 4.9, a subtask of T162029: Migrate all jessie hosts to Linux 4.9, as Open.
Jul 20 2017, 11:41 AM · Operations
ema closed T171145: cp3048 down, mgmt console not reachable as Resolved.

So as @MoritzMuehlenhoff mentioned on IRC the mgmt issues might have been due to T171041.

Jul 20 2017, 8:58 AM · Operations, Traffic
ema moved T120631: Security: Is it safe to enable Zero spoofing from Triage to Caching on the Traffic board.
Jul 20 2017, 8:47 AM · Traffic, Zero, Operations
ema moved T171028: Degraded RAID on cp1008 from Triage to Caching on the Traffic board.
Jul 20 2017, 8:46 AM · Traffic, Operations
ema moved T171032: Investigate lvs IP pages during codfw row C switch upgrade from Triage to LoadBalancer on the Traffic board.
Jul 20 2017, 8:46 AM · Operations, Traffic, netops
ema moved T171145: cp3048 down, mgmt console not reachable from Triage to Caching on the Traffic board.
Jul 20 2017, 8:46 AM · Operations, Traffic

Jul 19 2017

ema added a comment to T151643: python-varnishapi daemons seeing "Log overrun" constantly.

So it looks like the varnishstatsd overruns occur mostly in ulsfo:

Jul 19 2017, 7:14 PM · Patch-For-Review, Operations, Traffic
ema added a comment to T148976: Strongswan Icinga check: do not report issues about depooled hosts.

Icinga external commands include SCHEDULE_SVC_DOWNTIME, which seems handy. We could perhaps try writing a script that issues a SCHEDULE_SVC_DOWNTIME for the IPSec service for each host defined in the role::ipsec targets array?

Jul 19 2017, 1:22 PM · Operations, Traffic
ema added a comment to T151643: python-varnishapi daemons seeing "Log overrun" constantly.

Incidentally, while looking at entirely different stuff on esams recdns hosts, I've noticed that the vast majority of our DNS traffic there is due to cache hosts continuously asking for the A record of statsd.eqiad.wmnet.

Jul 19 2017, 12:55 PM · Patch-For-Review, Operations, Traffic
ema added a comment to T151643: python-varnishapi daemons seeing "Log overrun" constantly.

Yeah, all other daemons have been fixed but varnishstatsd seems to still be affected by this issue.

Jul 19 2017, 12:21 PM · Patch-For-Review, Operations, Traffic
ema created P5761 (An Untitled Masterwork).
Jul 19 2017, 9:32 AM
ema added a project to T171028: Degraded RAID on cp1008: Traffic.
Jul 19 2017, 8:05 AM · Traffic, Operations
ema triaged T171028: Degraded RAID on cp1008 as Normal priority.
Jul 19 2017, 8:05 AM · Traffic, Operations