Page MenuHomePhabricator

ema (Emanuele Rocca)
Senior Site Reliability Engineer, Traffic TeamAdministrator

Projects (6)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Sep 29 2015, 8:49 PM (248 w, 6 d)
Roles
Administrator
Availability
Available
IRC Nick
ema
LDAP User
Ema
MediaWiki User
Unknown

Recent Activity

Today

ema updated the task description for T255015: Varnish and ATS health-check improvements.
Tue, Jul 7, 12:15 PM · Patch-For-Review, Operations, Traffic

Yesterday

ema moved T255748: Netbox DNS change not effective in gdns from Triage to DNS Infra on the Traffic board.
Mon, Jul 6, 12:07 PM · DNS, Traffic, netbox, Operations

Fri, Jul 3

ema closed T256444: several purgeds badly backlogged (> 10 days) as Resolved.
Fri, Jul 3, 8:44 AM · Patch-For-Review, User-notice, Operations, Traffic
ema added a comment to T256444: several purgeds badly backlogged (> 10 days).

This was the last occurrence of the issue, and no other host has been affected since the librdkafka upgrade yesterday at 2020-07-01T11:50. Let's observe things till tomorrow and then I think we can close this.

Fri, Jul 3, 8:44 AM · Patch-For-Review, User-notice, Operations, Traffic

Thu, Jul 2

ema closed T253555: Remove ganglia leftovers from ops/puppet as Resolved.
Thu, Jul 2, 12:37 PM · Patch-For-Review, Analytics, Traffic, Operations
ema updated the task description for T253555: Remove ganglia leftovers from ops/puppet.
Thu, Jul 2, 12:37 PM · Patch-For-Review, Analytics, Traffic, Operations
ema created T256963: gerrit plugin error: self.onAction is not a function.
Thu, Jul 2, 11:57 AM · Gerrit
ema added a comment to T256446: monitoring & alerting for purged.

@CDanis all done except for rdkafka_consumer_topics_partitions_consumer_lag, there's silence on grafana.wikimedia.org/explore when looking for that metric, even going back one month. Let me know if you think, for the scope of this ticket, that event-lag and local-backlog are enough.

Thu, Jul 2, 11:37 AM · Sustainability (Incident Prevention), Operations, Traffic
ema updated the task description for T256446: monitoring & alerting for purged.
Thu, Jul 2, 11:31 AM · Sustainability (Incident Prevention), Operations, Traffic
ema moved T256217: ETAG response headers not always with double-quotes from Triage to Caching on the Traffic board.
Thu, Jul 2, 11:21 AM · Traffic, Operations, serviceops, affects-Kiwix-and-openZIM
ema moved T256302: Certain links being rejected by caching if opened in Internet Explorer with a HTTP 400 error from Triage to Caching on the Traffic board.
Thu, Jul 2, 9:26 AM · Operations, Traffic
ema added a comment to T256302: Certain links being rejected by caching if opened in Internet Explorer with a HTTP 400 error.

Interesting, I'd be inclined to think that the issue here cannot be simply the user agent or we would know. :-)

Thu, Jul 2, 9:26 AM · Operations, Traffic
ema triaged T256302: Certain links being rejected by caching if opened in Internet Explorer with a HTTP 400 error as Medium priority.
Thu, Jul 2, 9:22 AM · Operations, Traffic
ema moved T256447: Special:HideBanners is not really cacheable from Triage to Watching on the Traffic board.
Thu, Jul 2, 9:21 AM · Performance-Team (Radar), Spike, Operations, Traffic, Varnish, MediaWiki-extensions-CentralNotice, Fundraising-Backlog
ema added a comment to T253555: Remove ganglia leftovers from ops/puppet.

@fgiunchedi: the puppetmaster module still has some ganglia-related things such as prometheus-ganglia-gen. Is that still needed?

Thu, Jul 2, 9:19 AM · Patch-For-Review, Analytics, Traffic, Operations
ema updated the task description for T253555: Remove ganglia leftovers from ops/puppet.
Thu, Jul 2, 9:17 AM · Patch-For-Review, Analytics, Traffic, Operations
ema moved T254568: Accessing Phabricator from Tor from Triage to Watching on the Traffic board.
Thu, Jul 2, 9:16 AM · Operations, Traffic, Phabricator
ema added a comment to T256313: Cached thumbnails and originals are sometimes not being purged correctly/quickly.

The cause is most probably T256444. I'm saying this based on two important pieces of information submitted in this bug report by @AntiCompositeNumber (thanks!)

Thu, Jul 2, 9:12 AM · SRE-swift-storage, Commons, Operations, MediaWiki-File-management, Traffic
ema moved T256313: Cached thumbnails and originals are sometimes not being purged correctly/quickly from Triage to Caching on the Traffic board.
Thu, Jul 2, 8:56 AM · SRE-swift-storage, Commons, Operations, MediaWiki-File-management, Traffic
ema triaged T256313: Cached thumbnails and originals are sometimes not being purged correctly/quickly as Medium priority.
Thu, Jul 2, 8:56 AM · SRE-swift-storage, Commons, Operations, MediaWiki-File-management, Traffic
ema added a comment to T256444: several purgeds badly backlogged (> 10 days).

Mentioned in SAL (#wikimedia-operations) [2020-06-30T16:57:30Z] <cdanis> T256444 restarted purged on cp2030 and repooling

Thu, Jul 2, 8:52 AM · Patch-For-Review, User-notice, Operations, Traffic

Wed, Jul 1

ema added a project to T256863: restbase2009 down: RESTBase.
Wed, Jul 1, 11:01 AM · RESTBase, Operations, ops-codfw
ema created T256863: restbase2009 down.
Wed, Jul 1, 11:00 AM · RESTBase, Operations, ops-codfw
ema added a comment to T256444: several purgeds badly backlogged (> 10 days).

Mentioned in SAL (#wikimedia-operations) [2020-06-30T10:41:19Z] <ema> cp2040: restart purged and varnishkafka to use updated librdkafka1 T256444

Wed, Jul 1, 7:43 AM · Patch-For-Review, User-notice, Operations, Traffic
ema closed T256479: purged crashes with "fatal error: concurrent map read and map write" as Resolved.

Fixed by deploying https://gerrit.wikimedia.org/r/c/operations/software/purged/+/608045, the issue hasn't occurred on any host in the past 36 hours. Closing!

Wed, Jul 1, 7:33 AM · Operations, Traffic

Tue, Jun 30

ema updated the task description for T256446: monitoring & alerting for purged.
Tue, Jun 30, 8:44 AM · Sustainability (Incident Prevention), Operations, Traffic
ema edited P11703 librdkafka_0.11.6-1.1wmf1.debdiff.
Tue, Jun 30, 8:31 AM
ema added a comment to T256444: several purgeds badly backlogged (> 10 days).

There may be another solution, namely creating a new apt component to hold 1.4.x and deploy it selectively where needed (as opposed to roll it out everywhere). In theory on cp nodes it should be fine, varnishkafka should be compatible with 1.x.x API (some time ago I opened T210944 to track this and found only minor issues/nits) and if not we could come up with a quick patch to make it work. Then the rest of the nodes could use either what Debian provides, or the more up to date 1.4.x component. Finally, a we could use this as "test" for 1.x.x librdkafka on a specific use case before using it everywhere.

Tue, Jun 30, 8:22 AM · Patch-For-Review, User-notice, Operations, Traffic
ema created P11703 librdkafka_0.11.6-1.1wmf1.debdiff.
Tue, Jun 30, 7:31 AM

Mon, Jun 29

ema awarded T164819: reprepro: Support for buildinfo files / dbgsym packages a Heartbreak token.
Mon, Jun 29, 11:57 AM · Patch-For-Review, Operations
ema added a comment to T256444: several purgeds badly backlogged (> 10 days).

https://github.com/confluentinc/confluent-kafka-go/issues/251

Mon, Jun 29, 10:18 AM · Patch-For-Review, User-notice, Operations, Traffic
ema committed rOSPU290655390754: Release version 0.16 (authored by ema).
Release version 0.16
Mon, Jun 29, 9:19 AM
ema added a comment to T256444: several purgeds badly backlogged (> 10 days).

The issue happened again on cp4025 and a few other nodes. It looks like a deadlock in librdkafka to me, the process is spinning on pthread_cond_wait:

Mon, Jun 29, 8:22 AM · Patch-For-Review, User-notice, Operations, Traffic

Sat, Jun 27

ema added a comment to T256444: several purgeds badly backlogged (> 10 days).

I believe this is resolved now, right? Or Is there still user-facing impact in the form of served cache objects eg more than a few minutes past their purge date?

Sat, Jun 27, 9:54 AM · Patch-For-Review, User-notice, Operations, Traffic

Fri, Jun 26

ema moved T256479: purged crashes with "fatal error: concurrent map read and map write" from Triage to Caching on the Traffic board.
Fri, Jun 26, 1:29 PM · Operations, Traffic
ema triaged T256479: purged crashes with "fatal error: concurrent map read and map write" as Medium priority.
Fri, Jun 26, 1:29 PM · Operations, Traffic
ema created T256479: purged crashes with "fatal error: concurrent map read and map write".
Fri, Jun 26, 1:19 PM · Operations, Traffic
ema moved T256467: Make atsmtail-backend.service depend on fifo-log-demux from Triage to Caching on the Traffic board.
Fri, Jun 26, 10:30 AM · Operations, Traffic
ema closed T256449: cp5006 multiple alerts (and SSH flapping) as Resolved.

The host looks fine, closing for now.

Fri, Jun 26, 10:30 AM · Traffic, Operations, ops-eqsin
ema triaged T256467: Make atsmtail-backend.service depend on fifo-log-demux as Low priority.
Fri, Jun 26, 10:29 AM · Operations, Traffic
ema created T256467: Make atsmtail-backend.service depend on fifo-log-demux.
Fri, Jun 26, 10:29 AM · Operations, Traffic
ema added a comment to T256444: several purgeds badly backlogged (> 10 days).

I have identified the misbehaving purged instances with rate(purged_events_received_total{cluster="cache_text", topic="eqiad.resource-purge"}[5m]) == 0 and restarted them.

Fri, Jun 26, 10:21 AM · Patch-For-Review, User-notice, Operations, Traffic
ema moved T256444: several purgeds badly backlogged (> 10 days) from Triage to Caching on the Traffic board.
Fri, Jun 26, 9:49 AM · Patch-For-Review, User-notice, Operations, Traffic
ema moved T256446: monitoring & alerting for purged from Triage to Caching on the Traffic board.
Fri, Jun 26, 9:49 AM · Sustainability (Incident Prevention), Operations, Traffic
ema triaged T256446: monitoring & alerting for purged as Medium priority.
Fri, Jun 26, 9:47 AM · Sustainability (Incident Prevention), Operations, Traffic
ema triaged T256444: several purgeds badly backlogged (> 10 days) as High priority.
Fri, Jun 26, 9:47 AM · Patch-For-Review, User-notice, Operations, Traffic

Thu, Jun 25

ema created P11661 (An Untitled Masterwork).
Thu, Jun 25, 12:01 PM
ema placed T256138: Create ssh keypair for integration/docroot deployment with scap up for grabs.
Thu, Jun 25, 11:50 AM · Operations
ema added a comment to T256138: Create ssh keypair for integration/docroot deployment with scap.

Key generated and added to the private puppet repo under modules/secret/secrets/keyholder.

Thu, Jun 25, 11:49 AM · Operations
ema moved T256201: Add Guergana Tzatchkova to the ldap/wmde and ldap/nda group from Backlog to NDA Pending on the LDAP-Access-Requests board.
Thu, Jun 25, 8:45 AM · WMF-Legal, LDAP-Access-Requests, Operations
ema closed T254135: WMF-NDA access for ngkountas as Declined.

Declining for the same reason as T254134.

Thu, Jun 25, 8:42 AM · WMF-NDA-Requests
ema closed T254134: WMF-NDA access for _abi as Declined.

@abi_ @Majavah - as I understand it, being a member of Trusted-Contributors is enough to vote on the poll and there's no NDA necessary. Declining this task, feel free to reopen if my understanding is incorrect.

Thu, Jun 25, 8:41 AM · WMF-NDA-Requests

Wed, Jun 24

ema closed T255525: Close teampractices mailing list (as it has no active admins) as Resolved.
Wed, Jun 24, 2:58 PM · Operations, Wikimedia-Mailing-lists
ema triaged T255525: Close teampractices mailing list (as it has no active admins) as Medium priority.
Wed, Jun 24, 2:55 PM · Operations, Wikimedia-Mailing-lists
ema added a comment to T256193: Request for new mailing list for ILAE English Wikipedia project.

@Diptanshu.D: list created, you should have received an email.

Wed, Jun 24, 2:38 PM · Wikimedia-Mailing-lists, Operations
ema triaged T255951: Creation of mailinglist for Board of WUG Esperanto and Free Knowledge as Medium priority.
Wed, Jun 24, 1:54 PM · Operations, Wikimedia-Mailing-lists
ema triaged T256193: Request for new mailing list for ILAE English Wikipedia project as Medium priority.
Wed, Jun 24, 1:54 PM · Wikimedia-Mailing-lists, Operations
ema moved T255525: Close teampractices mailing list (as it has no active admins) from Backlog to List maintenance on the Wikimedia-Mailing-lists board.
Wed, Jun 24, 1:53 PM · Operations, Wikimedia-Mailing-lists
ema moved T255951: Creation of mailinglist for Board of WUG Esperanto and Free Knowledge from Backlog to List creation on the Wikimedia-Mailing-lists board.
Wed, Jun 24, 1:53 PM · Operations, Wikimedia-Mailing-lists
ema moved T256193: Request for new mailing list for ILAE English Wikipedia project from Backlog to List creation on the Wikimedia-Mailing-lists board.
Wed, Jun 24, 1:53 PM · Wikimedia-Mailing-lists, Operations
ema moved T249678: Add oauth login to the mailman package for accessing list memberships/archive viewing from Backlog to General on the Wikimedia-Mailing-lists board.
Wed, Jun 24, 1:53 PM · Upstream, Operations, Wikimedia-Mailing-lists
ema moved T248384: Delete email addresses with privileged @domain names from mailing lists at offboarding from Backlog to General on the Wikimedia-Mailing-lists board.
Wed, Jun 24, 1:53 PM · Operations, Wikimedia-Mailing-lists
ema moved T247603: Email to WikimediaUA mailing list from base-w[at]yandex.ru does not get delivered from Backlog to General on the Wikimedia-Mailing-lists board.
Wed, Jun 24, 1:53 PM · Mail, Operations, Wikimedia-Mailing-lists
ema moved T244241: Allow list admins to train spam filters from Backlog to General on the Wikimedia-Mailing-lists board.
Wed, Jun 24, 1:53 PM · serviceops, Operations, Wikimedia-Mailing-lists
ema moved T240929: Migrate archives of the OKFN-hosted Open-GLAM mailing list to Wikimedia's mailman from Backlog to General on the Wikimedia-Mailing-lists board.
Wed, Jun 24, 1:52 PM · Operations, Wikimedia-Mailing-lists
ema moved T232417: mass Yahoo / AOL bounces mailman from Backlog to General on the Wikimedia-Mailing-lists board.
Wed, Jun 24, 1:52 PM · Mail, Operations, Wikimedia-Mailing-lists
ema moved T225553: gmail users being suspended from mediawiki-l due to excessive bounces due to DMARC from Backlog to General on the Wikimedia-Mailing-lists board.
Wed, Jun 24, 1:52 PM · Operations, Wikimedia-Mailing-lists
ema moved T225269: Verify that all mailman mailing lists have private_roster=2 from Backlog to General on the Wikimedia-Mailing-lists board.
Wed, Jun 24, 1:52 PM · Operations, Wikimedia-Mailing-lists
ema moved T197819: investigate caching of mailman listinfo pages from Backlog to General on the Wikimedia-Mailing-lists board.
Wed, Jun 24, 1:52 PM · Mail, Operations, Wikimedia-Mailing-lists
ema moved T194669: Provide a mean to mass discard/reject subscription requests on Wikimedia mailing lists from Backlog to General on the Wikimedia-Mailing-lists board.
Wed, Jun 24, 1:52 PM · Wikimedia-Mailing-lists, Operations
ema moved T193573: Consider allowing mailing lists to be indexed by archive.org from Backlog to General on the Wikimedia-Mailing-lists board.
Wed, Jun 24, 1:52 PM · Operations, Wikimedia-Mailing-lists, Internet-Archive
ema moved T190054: Pipermail on lists.wikimedia.org is not mobile friendly from Backlog to General on the Wikimedia-Mailing-lists board.
Wed, Jun 24, 1:52 PM · Operations, Wikimedia-Mailing-lists, Mobile
ema moved T190061: Pipermail uses background color without foreground colors from Backlog to General on the Wikimedia-Mailing-lists board.
Wed, Jun 24, 1:52 PM · Operations, Wikimedia-Mailing-lists, Accessibility
ema moved T186311: wikitech-l is mangling my PGP/MIME emails, causing signature validation to fail from Backlog to General on the Wikimedia-Mailing-lists board.
Wed, Jun 24, 1:52 PM · Operations, Wikimedia-Mailing-lists
ema moved T179568: all mailing lists should have descriptions from Backlog to General on the Wikimedia-Mailing-lists board.
Wed, Jun 24, 1:52 PM · Operations, Wikimedia-Mailing-lists
ema moved T173894: Mailman cannot correctly decode GB2312-superset mails labelled as GB2312 (non-standard behavior) from Backlog to General on the Wikimedia-Mailing-lists board.
Wed, Jun 24, 1:52 PM · OTRS, Operations, Wikimedia-Mailing-lists, Chinese-Sites
ema moved T172929: https://lists.wikimedia.org/mailman/options/ doesn't set charset header from Backlog to General on the Wikimedia-Mailing-lists board.
Wed, Jun 24, 1:51 PM · Operations, Upstream, Wikimedia-Mailing-lists
ema moved T170443: Allow applying spam filter fules before sender rules in Mailman filtering from Backlog to General on the Wikimedia-Mailing-lists board.
Wed, Jun 24, 1:51 PM · Operations, Upstream, Wikimedia-Mailing-lists
ema moved T150164: Mailman: Consider hiding real list administrators email addresses from Backlog to General on the Wikimedia-Mailing-lists board.
Wed, Jun 24, 1:51 PM · Operations, Upstream, Wikimedia-Mailing-lists
ema moved T115329: "From" at start of line becomes ">From" in pipermail from Backlog to General on the Wikimedia-Mailing-lists board.
Wed, Jun 24, 1:51 PM · Operations, Upstream, Wikimedia-Mailing-lists
ema triaged T256217: ETAG response headers not always with double-quotes as Medium priority.
Wed, Jun 24, 10:20 AM · Traffic, Operations, serviceops, affects-Kiwix-and-openZIM
ema added a comment to T254442: NDA for superset access request from WMDE employee danshick.

@KFrancis: let me know when Dan is added to the NDA and MOU spreadsheet so that I can carry on with this request. Thanks!

Wed, Jun 24, 9:22 AM · LDAP-Access-Requests, Operations
ema triaged T256201: Add Guergana Tzatchkova to the ldap/wmde and ldap/nda group as Medium priority.
Wed, Jun 24, 9:13 AM · WMF-Legal, LDAP-Access-Requests, Operations
ema claimed T256138: Create ssh keypair for integration/docroot deployment with scap.
Wed, Jun 24, 9:01 AM · Operations
ema updated the task description for T255836: Requesting access to centralauth database for Jennifer Wang.
Wed, Jun 24, 8:52 AM · Operations, SRE-Access-Requests

Tue, Jun 23

ema moved T255914: Allow AKlapper to disable other people's personal Herald rules in Phabricator from Untriaged to Ready To Go on the SRE-Access-Requests board.
Tue, Jun 23, 9:21 AM · Operations, SRE-Access-Requests
ema moved T255836: Requesting access to centralauth database for Jennifer Wang from Untriaged to Awaiting User Input on the SRE-Access-Requests board.
Tue, Jun 23, 9:21 AM · Operations, SRE-Access-Requests
ema added a comment to T255836: Requesting access to centralauth database for Jennifer Wang.

Hi @jwang, to carry on with your access request we need some additional information:

Tue, Jun 23, 9:21 AM · Operations, SRE-Access-Requests
ema updated the task description for T255836: Requesting access to centralauth database for Jennifer Wang.
Tue, Jun 23, 9:14 AM · Operations, SRE-Access-Requests
ema triaged T255836: Requesting access to centralauth database for Jennifer Wang as Medium priority.
Tue, Jun 23, 9:10 AM · Operations, SRE-Access-Requests
ema added a comment to T255914: Allow AKlapper to disable other people's personal Herald rules in Phabricator.

@Aklapper: patch merged, please try and see if the command works as expected.

Tue, Jun 23, 9:08 AM · Operations, SRE-Access-Requests
ema triaged T255914: Allow AKlapper to disable other people's personal Herald rules in Phabricator as Medium priority.
Tue, Jun 23, 8:55 AM · Operations, SRE-Access-Requests
ema moved T254442: NDA for superset access request from WMDE employee danshick from Awaiting User Input to NDA Pending on the LDAP-Access-Requests board.
Tue, Jun 23, 8:50 AM · LDAP-Access-Requests, Operations
ema closed T255775: Add Abban to the ldap/nda group as Resolved.

Done.

Tue, Jun 23, 8:47 AM · Operations, LDAP-Access-Requests

Fri, Jun 19

ema edited P11602 (An Untitled Masterwork).
Fri, Jun 19, 8:00 AM
ema created P11602 (An Untitled Masterwork).
Fri, Jun 19, 7:59 AM

Thu, Jun 18

ema added a comment to T242767: EventStreams drops the connection after 15 minutes, which makes it unreliable.

Hm, I'm pretty sure the connection is terminated even when there are events being sent.

Thu, Jun 18, 3:43 PM · Patch-For-Review, Traffic, Operations, Analytics-Kanban, Analytics, EventStreams
ema closed T255368: noc.wikimedia.org consistently 503s in eqsin and sometimes 503s in esams as Resolved.

what we need to do is (1) ensure the origins don't send Transfer-Encoding on 304 responses and (2) make sure ATS does not add Transfer-Encoding to cached objects when receiving a 304.

Thu, Jun 18, 11:51 AM · Traffic, Operations

Wed, Jun 17

ema added a comment to T255368: noc.wikimedia.org consistently 503s in eqsin and sometimes 503s in esams.

Alright I finally understood what's going on. The problem here is that (1) the origin server is adding Transfer-Encoding: chunked to conditional HEAD requests, which if possible makes no sense twice: HEAD responses don't have body, and 304s also don't:

Wed, Jun 17, 11:31 AM · Traffic, Operations

Tue, Jun 16

ema added a comment to T255368: noc.wikimedia.org consistently 503s in eqsin and sometimes 503s in esams.

we should try to reproduce TE:chunked being added to a stale object on 304 responses from the origin

Tue, Jun 16, 2:18 PM · Traffic, Operations