Page MenuHomePhabricator
Feed Advanced Search

Thu, Apr 15

CDanis committed rOHPUd9843950d162: Revert "prepend esams/knams" (authored by CDanis).
Revert "prepend esams/knams"
Thu, Apr 15, 3:40 PM
CDanis added a reverting change for rOHPU43f3d38d6d2f: prepend esams/knams: rOHPUd9843950d162: Revert "prepend esams/knams".
Thu, Apr 15, 3:40 PM

Wed, Apr 14

CDanis committed rOHPU43f3d38d6d2f: prepend esams/knams (authored by CDanis).
prepend esams/knams
Wed, Apr 14, 9:51 PM
CDanis updated the task description for T257527: automatically collect network error reports from users' browsers (Network Error Logging API).
Wed, Apr 14, 9:42 PM · Patch-For-Review, Product-Data-Infrastructure, SRE, Goal, Epic

Wed, Apr 7

JAllemandou awarded T279380: Add Traffic's notion of "from public cloud" to Analytics webrequest data a Baby Tequila token.
Wed, Apr 7, 7:58 AM · Patch-For-Review, SRE, Analytics, Traffic

Mon, Apr 5

CDanis created T279380: Add Traffic's notion of "from public cloud" to Analytics webrequest data.
Mon, Apr 5, 8:02 PM · Patch-For-Review, SRE, Analytics, Traffic

Wed, Mar 31

CDanis added a comment to T279013: Phabricator intermittently slow; db connection failures to m3-master.eqiad.wmnet with "Temporary failure in name resolution".

I'm not sure if we tend to use IP addresses directly in Mediawiki out of latency concerns, reliability concerns, PHP/other client bugginess concerns, or DNS recursor capacity concerns, or some mix of all of the above. (I've seen all of the above occur at different times.)

Wed, Mar 31, 9:25 PM · serviceops, User-brennen, DBA, Phabricator

Mar 15 2021

CDanis updated the task description for T277485: Include request_id in the comments of database queries.
Mar 15 2021, 5:33 PM · Platform Team Workboards (Clinic Duty Team), Wikimedia-Rdbms, serviceops-radar
CDanis added a subtask for T277417: 14 March 2021 Wikimedia API Outage: T277485: Include request_id in the comments of database queries.
Mar 15 2021, 4:53 PM · Wikimedia-Incident, SRE
CDanis added a parent task for T277485: Include request_id in the comments of database queries: T277417: 14 March 2021 Wikimedia API Outage.
Mar 15 2021, 4:53 PM · Platform Team Workboards (Clinic Duty Team), Wikimedia-Rdbms, serviceops-radar
CDanis created T277485: Include request_id in the comments of database queries.
Mar 15 2021, 4:53 PM · Platform Team Workboards (Clinic Duty Team), Wikimedia-Rdbms, serviceops-radar

Mar 10 2021

CDanis closed T263496: Augment NEL reports with GeoIP country code and network AS number as Resolved.

ASN, ISP/organization, country, & subdivision are now visible in Logstash!

Mar 10 2021, 9:36 PM · Patch-For-Review, Analytics, SRE
CDanis closed T263496: Augment NEL reports with GeoIP country code and network AS number, a subtask of T257527: automatically collect network error reports from users' browsers (Network Error Logging API), as Resolved.
Mar 10 2021, 9:35 PM · Patch-For-Review, Product-Data-Infrastructure, SRE, Goal, Epic
CDanis created T277096: WMF helmfile installation does not work for ZSH users.
Mar 10 2021, 8:42 PM · Kubernetes, serviceops
CDanis updated the task description for T276486: gerrit's sshd is incompatible with RSA pubkeys + Fedora 33 clients (and future versions of OpenSSH proper).
Mar 10 2021, 3:12 PM · Gerrit
CDanis updated the task description for T276486: gerrit's sshd is incompatible with RSA pubkeys + Fedora 33 clients (and future versions of OpenSSH proper).
Mar 10 2021, 3:10 PM · Gerrit

Mar 8 2021

CDanis added a comment to T276486: gerrit's sshd is incompatible with RSA pubkeys + Fedora 33 clients (and future versions of OpenSSH proper).

What I understand is that the Mina SSHD version embedded in our Gerrit does not have support for the rsa-sha-xxx and would eventually need the version it ships would need to be upgraded.

Mar 8 2021, 2:21 PM · Gerrit

Mar 4 2021

CDanis updated the task description for T276486: gerrit's sshd is incompatible with RSA pubkeys + Fedora 33 clients (and future versions of OpenSSH proper).
Mar 4 2021, 5:46 PM · Gerrit
CDanis created T276486: gerrit's sshd is incompatible with RSA pubkeys + Fedora 33 clients (and future versions of OpenSSH proper).
Mar 4 2021, 5:42 PM · Gerrit

Mar 3 2021

CDanis renamed T276299: alert1001's tcpircbot down for all internal clients (spicerack, helmfile, dbctl, klaxon, etc) from dbctl not sending !log to IRC to alert1001's tcpircbot down for all internal clients (spicerack, helmfile, dbctl, klaxon, etc).
Mar 3 2021, 4:17 PM · observability, SRE
CDanis updated subscribers of T276299: alert1001's tcpircbot down for all internal clients (spicerack, helmfile, dbctl, klaxon, etc).

I believe this is related to https://sal.toolforge.org/log/YZV29XcBa_6PSCT9KHrZ

Mar 3 2021, 7:04 AM · observability, SRE
CDanis added a comment to T267714: ripe-atlas-codfw is down.

I'm pretty sure the baud rate is 19200

Mar 3 2021, 1:54 AM · ops-codfw, SRE, netops

Mar 2 2021

CDanis updated the language for P14571 Command-Line Input from autodetect to js.
Mar 2 2021, 9:49 PM
CDanis created P14571 Command-Line Input.
Mar 2 2021, 9:48 PM
CDanis created P14570 Command-Line Input.
Mar 2 2021, 7:47 PM
CDanis added a comment to T276213: Sudden surge of requests to https://wikipedia.org/ from Telus customers.

Did this cause any actual issue?

Mar 2 2021, 2:22 PM · Traffic, SRE
CDanis added a project to P8871 git fetch over anon HTTPS, git push over SSH: Gerrit.
Mar 2 2021, 2:17 PM · Gerrit

Feb 25 2021

CDanis added a comment to T274888: cp_upload @ eqsin cascading failures, February 2021.
  1. sh hashing - I think @CDanis already worked on some patches to transition us to maglev hashing a quarter or two ago, but it wasn't ever deployed to all production LVSes. Need to take a look at the state of affairs on that and see if we're at a point where we can safely deploy it (also, re-check assumptions about whether it relies on a weight=0 depooling strategy?).
Feb 25 2021, 7:04 PM · Patch-For-Review, SRE, Traffic
CDanis added a comment to T275806: wmf-utils has an outdated script to update known hosts files.

wmf-utils is a little-used repo, with only two scripts in it.

Feb 25 2021, 6:37 PM · SRE

Feb 19 2021

CDanis updated subscribers of T275234: TATA SKY Broadband (AS134674) issues with connecting to upload.wikimedia.org.
Feb 19 2021, 6:59 PM · SRE, netops, Traffic
CDanis added a parent task for T275211: TATA SKY users unable to connect with upload.wikimedia.org in browsers except Opera: T275234: TATA SKY Broadband (AS134674) issues with connecting to upload.wikimedia.org.
Feb 19 2021, 6:58 PM · SRE, Traffic
CDanis added a subtask for T275234: TATA SKY Broadband (AS134674) issues with connecting to upload.wikimedia.org: T275211: TATA SKY users unable to connect with upload.wikimedia.org in browsers except Opera.
Feb 19 2021, 6:58 PM · SRE, netops, Traffic
CDanis created T275234: TATA SKY Broadband (AS134674) issues with connecting to upload.wikimedia.org.
Feb 19 2021, 6:58 PM · SRE, netops, Traffic

Feb 18 2021

CDanis closed T275028: validate or revert the new large_objects_cutoff & nuke_limit settings on upload@eqsin, a subtask of T274888: cp_upload @ eqsin cascading failures, February 2021, as Resolved.
Feb 18 2021, 8:05 PM · Patch-For-Review, SRE, Traffic
CDanis closed T275028: validate or revert the new large_objects_cutoff & nuke_limit settings on upload@eqsin as Resolved.

Caches have filled, and there's no more fetch failures due to "LRU limited": https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&var-datasource=eqsin%20prometheus%2Fops&var-cache_type=upload&var-server=All&var-layer=frontend&from=1613576456135&to=1613678433179

Feb 18 2021, 8:05 PM · SRE, Traffic

Feb 17 2021

CDanis created T275046: provision more machines for eqsin caches.
Feb 17 2021, 4:48 PM · SRE, Traffic
CDanis added a comment to T263496: Augment NEL reports with GeoIP country code and network AS number.

@Ottomata just one more question for you!

Feb 17 2021, 4:18 PM · Patch-For-Review, Analytics, SRE
CDanis renamed T275028: validate or revert the new large_objects_cutoff & nuke_limit settings on upload@eqsin from validate or revert the new large_objects_cutoff & nule_limit settings on upload@eqsin to validate or revert the new large_objects_cutoff & nuke_limit settings on upload@eqsin.
Feb 17 2021, 1:56 PM · SRE, Traffic
CDanis created T275028: validate or revert the new large_objects_cutoff & nuke_limit settings on upload@eqsin.
Feb 17 2021, 1:41 PM · SRE, Traffic

Feb 16 2021

CDanis reassigned T273780: Request to add Georgina Burnett to the ldap/wmde group from KFrancis to MoritzMuehlenhoff.
Feb 16 2021, 7:16 PM · LDAP-Access-Requests, SRE
CDanis created T274888: cp_upload @ eqsin cascading failures, February 2021.
Feb 16 2021, 2:45 PM · Patch-For-Review, SRE, Traffic

Feb 14 2021

CDanis created T274734: check_icinga alerts lack metadata in VictorOps.
Feb 14 2021, 1:47 PM · observability

Feb 11 2021

Urbanecm awarded T274595: Either include X-Varnish in MediaWiki logs and include the X-Varnish in Varnish 5xx logs; or, include the beresp X-Request-Id in Varnish 5xx logs a Love token.
Feb 11 2021, 10:12 PM · SRE
CDanis created T274595: Either include X-Varnish in MediaWiki logs and include the X-Varnish in Varnish 5xx logs; or, include the beresp X-Request-Id in Varnish 5xx logs.
Feb 11 2021, 10:11 PM · SRE
CDanis added a comment to T273780: Request to add Georgina Burnett to the ldap/wmde group.

Email address sent privately.

Feb 11 2021, 6:33 PM · LDAP-Access-Requests, SRE
CDanis updated the task description for T273780: Request to add Georgina Burnett to the ldap/wmde group.
Feb 11 2021, 6:33 PM · LDAP-Access-Requests, SRE

Feb 9 2021

CDanis added a comment to T269324: Productionize x2 databases.

nightmare […] It all starts with having to depool them via a MW commit, […]

I'm confident we can avoid this for mainstash db (x2).

The fact that this requires a MW commit today for parser cache is afaik "just" because we haven't moved more of db-related config to Etcd. As with previous moves, we need to be very careful and aware of the contractual expectations and needs, which for parser cache are indeed non-trivial, but ultimately it is just a primitive array structure that has no inherent need for being in PHP or MW config. As far as I'm concerned parser cache config can and should, too, be moved to Etcd if there are no other stakeholders raising concerns against that.

I would love to see that, but not sure how doable it is. That's probably for @CDanis to estimate.

Feb 9 2021, 1:52 PM · Performance-Team (Radar), Patch-For-Review, DBA

Feb 8 2021

CDanis added a comment to T273983: eqiad: Move maps1001 same rack A4.

@hnowlan just a heads up that it looks like the depool of maps1001 left maps@eqiad underprovisioned:
https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&from=1612777514651&to=1612819057362

Feb 8 2021, 9:19 PM · SRE, ops-eqiad
CDanis closed T273602: Access to analytics-privatedata-users for Research contractor AikoChou as Resolved.

@CDanis nda LDAP is also needed for Jupyter access. Pretty much all users should get LDAP access if they are getting any access at all.

Feb 8 2021, 2:30 PM · Research, SRE, SRE-Access-Requests

Feb 5 2021

CDanis updated the task description for T272982: Add kzeta to analytics-privatedata-users.
Feb 5 2021, 11:12 PM · SRE, SRE-Access-Requests, Analytics
CDanis added a comment to T272982: Add kzeta to analytics-privatedata-users.

I contacted Carol on Slack; this request is approved.

Feb 5 2021, 11:12 PM · SRE, SRE-Access-Requests, Analytics
CDanis closed T271602: Hue access for Peter Pelberg as Resolved.

Glad to hear, thanks!

Feb 5 2021, 6:57 PM · SRE, SRE-Access-Requests
CDanis added a comment to T271602: Hue access for Peter Pelberg.

Hi, just wanted to check in if anything more was needed here?

Feb 5 2021, 6:54 PM · SRE, SRE-Access-Requests
CDanis assigned T273780: Request to add Georgina Burnett to the ldap/wmde group to KFrancis.

@KFrancis Can you please get an NDA signed with this WMDE staff member? Thanks!

Feb 5 2021, 6:45 PM · LDAP-Access-Requests, SRE
CDanis assigned T273813: Access to Product Superset for Rmurthy to jrobell.

@jrobell Can you please confirm? Thanks!

Feb 5 2021, 6:43 PM · SRE, LDAP-Access-Requests
CDanis closed T273980: Grant Access to `wmf` LDAP group for AGueyte as Resolved.

The wmf group does not require manager approval -- only verification that staff is staff :)

Feb 5 2021, 6:42 PM · SRE, LDAP-Access-Requests
CDanis closed T273602: Access to analytics-privatedata-users for Research contractor AikoChou as Resolved.

@AikoChou should have shell access within half an hour.

Feb 5 2021, 6:13 PM · Research, SRE, SRE-Access-Requests
CDanis added a comment to T273602: Access to analytics-privatedata-users for Research contractor AikoChou.

Checked in with Miriam on IRC and Turnilo/Superset access isn't needed, but Kerberos is. Doing that now.

Feb 5 2021, 5:56 PM · Research, SRE, SRE-Access-Requests
CDanis added a comment to T273602: Access to analytics-privatedata-users for Research contractor AikoChou.

Thank you!

Feb 5 2021, 5:47 PM · Research, SRE, SRE-Access-Requests
CDanis updated subscribers of T273951: Update Icinga meta-monitoring to account for "no pagers" in contacts.
Feb 5 2021, 2:28 PM · User-fgiunchedi, observability

Feb 4 2021

CDanis updated subscribers of T273064: Setup Analytics team in VO/splunk oncall.

I think the change pushed today in the puppet private repo (hash b1b32d4ab) broke the meta-monitoring validation script.

Feb 4 2021, 8:40 PM · Patch-For-Review, Analytics-Kanban, Analytics-Clusters, User-fgiunchedi, observability
CDanis edited projects for T268344: Implement SSH CA (certificate authority) for host keys?, added: SRE; removed serviceops.
Feb 4 2021, 2:18 PM · User-MoritzMuehlenhoff, SRE, Security, cloud-services-team (Kanban)

Feb 2 2021

CDanis assigned T273602: Access to analytics-privatedata-users for Research contractor AikoChou to AikoChou.

Waiting on @AikoChou to complete prerequisites and also for @Ottomata to approve from Analytics.

Feb 2 2021, 10:09 PM · Research, SRE, SRE-Access-Requests
CDanis triaged T273685: Turnilo "Display Druid query" gives "general error" as Low priority.
Feb 2 2021, 10:07 PM · Analytics
CDanis created T273685: Turnilo "Display Druid query" gives "general error".
Feb 2 2021, 10:07 PM · Analytics

Jan 29 2021

CDanis closed T273301: cr1-eqiad<>asw2-d-eqiad link down as Resolved.

looks good now, thanks!

Jan 29 2021, 10:13 PM · SRE, netops, ops-eqiad
CDanis added a comment to T273332: Unable to get production access : Maya Kampurath (due to bastion host changed).

Hi Maya,

Jan 29 2021, 9:54 PM · SRE, SRE-Access-Requests
CDanis reopened T273301: cr1-eqiad<>asw2-d-eqiad link down as "Open".

I think something is still wrong? LibreNMS is showing the port on the asw receiving about 7kbps of errors: https://librenms.wikimedia.org/device/device=149/tab=port/port=12410/view=graphs/

Jan 29 2021, 9:50 PM · SRE, netops, ops-eqiad
CDanis added a comment to T273328: cr4-ulsfo<>cr2-eqsin GRE tunnel flapping due to BFD timer expired.

The first few cycles of logs from the ulsfo side:

Jan 29 20:21:46  cr4-ulsfo bfdd[16019]: BFD Session fe80::827f:f800:43:6b66 (IFL 75) state Up -> Down LD/RD(159/26) Up time:4d 18:26 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Jan 29 20:21:46  cr4-ulsfo bfdd[16019]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 159, new state: down, interface: gr-0/0/0.1, peer addr: fe80::827f:f800:43:6b66
Jan 29 20:21:46  cr4-ulsfo rpd[16292]: RPD_OSPF_NBRDOWN: OSPF neighbor fe80::827f:f800:43:6b66 (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Jan 29 20:21:46  cr4-ulsfo rpd[16292]: RPD_OSPF_NBRUP: OSPF neighbor fe80::827f:f800:43:6b66 (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router)
Jan 29 20:21:47  cr4-ulsfo bfdd[16019]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 159, new state: up, interface: gr-0/0/0.1, peer addr: fe80::827f:f800:43:6b66
Jan 29 20:22:09  cr4-ulsfo rpd[16292]: RPD_OSPF_NBRUP: OSPF neighbor fe80::827f:f800:43:6b66 (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of master completed)
Jan 29 20:22:39  cr4-ulsfo bfdd[16019]: BFD Session fe80::827f:f800:43:6b66 (IFL 75) state Up -> Down LD/RD(159/26) Up time:00:00:52 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Jan 29 20:22:39  cr4-ulsfo bfdd[16019]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 159, new state: down, interface: gr-0/0/0.1, peer addr: fe80::827f:f800:43:6b66
Jan 29 20:22:39  cr4-ulsfo rpd[16292]: RPD_OSPF_NBRDOWN: OSPF neighbor fe80::827f:f800:43:6b66 (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Jan 29 20:22:39  cr4-ulsfo rpd[16292]: RPD_OSPF_NBRUP: OSPF neighbor fe80::827f:f800:43:6b66 (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router)
Jan 29 20:22:41  cr4-ulsfo rpd[16292]: RPD_OSPF_NBRUP: OSPF neighbor fe80::827f:f800:43:6b66 (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of master completed)
Jan 29 20:22:43  cr4-ulsfo bfdd[16019]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 159, new state: up, interface: gr-0/0/0.1, peer addr: fe80::827f:f800:43:6b66
Jan 29 20:22:50  cr4-ulsfo bfdd[16019]: BFD Session fe80::827f:f800:43:6b66 (IFL 75) state Up -> Down LD/RD(159/26) Up time:00:00:06 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Jan 29 2021, 9:23 PM · SRE, Traffic, netops
CDanis triaged T273328: cr4-ulsfo<>cr2-eqsin GRE tunnel flapping due to BFD timer expired as High priority.
Jan 29 2021, 9:21 PM · SRE, Traffic, netops
CDanis created T273328: cr4-ulsfo<>cr2-eqsin GRE tunnel flapping due to BFD timer expired.
Jan 29 2021, 9:21 PM · SRE, Traffic, netops
CDanis added a comment to T273312: Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade.

Here are overall, whole-cluster mean latency milliseconds, summed across all machines and then broken down by kernel version.

Jan 29 2021, 9:01 PM · Performance-Team (Radar), User-jijiki, SRE, Release-Engineering-Team (Deployment services), Release-Engineering-Team-TODO, serviceops
CDanis triaged T273301: cr1-eqiad<>asw2-d-eqiad link down as High priority.
Jan 29 2021, 4:55 PM · SRE, netops, ops-eqiad
CDanis created T273301: cr1-eqiad<>asw2-d-eqiad link down.
Jan 29 2021, 4:55 PM · SRE, netops, ops-eqiad

Jan 28 2021

CDanis committed rOHPUae3191cb6531: decom Zayo transit in codfw (authored by CDanis).
decom Zayo transit in codfw
Jan 28 2021, 7:10 PM

Jan 27 2021

CDanis added a comment to T273003: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API.

Can you please provide a complete dump of a "null response", with both the complete response headers and the raw response body?

Jan 27 2021, 3:12 PM · Traffic, SRE

Jan 26 2021

CDanis added projects to T273003: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API: Traffic, serviceops.

It seems the User-Agent being used is Peachy MediaWiki Bot API Version 2.0 (alpha 8) (which ideally should be updated to comply with the User-Agent policy).

Jan 26 2021, 7:05 PM · Traffic, SRE
CDanis added a comment to T273003: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API.

What is the originating IP address of these requests?

Jan 26 2021, 6:44 PM · Traffic, SRE
CDanis created T272988: 'Move circuit' script.
Jan 26 2021, 4:00 PM · netbox

Jan 22 2021

CDanis added a comment to T272633: ChartMuseum responses are cached in the CDN with default (24h) ttl.

Unfortunately, upstream was not very responsive on my question about adding Cache-Control (https://github.com/helm/chartmuseum/issues/368). I wonder why this problem arised only recently as we ran this configuration from the start.

As ChartMuseum does an internal caching of the repo objects as well, I think we can switch to 'pass' for now. Will prepare a patch.

Jan 22 2021, 2:36 PM · SRE, Traffic, serviceops

Jan 21 2021

CDanis added a comment to T272633: ChartMuseum responses are cached in the CDN with default (24h) ttl.

An easy way to do this would be to just switch 'normal' to 'pass' here. Then there would be no caching at all. We do the same for other services like cxserver, API, blubberoid, config-master, debmonitor and others.

Would that be ok? If not then it probably needs a new type of caching besides the existing: normal, pass, pipe and websockets?

Jan 21 2021, 9:09 PM · SRE, Traffic, serviceops
CDanis created T272633: ChartMuseum responses are cached in the CDN with default (24h) ttl.
Jan 21 2021, 8:20 PM · SRE, Traffic, serviceops
CDanis created P13881 Command-Line Input.
Jan 21 2021, 8:04 PM
CDanis added a comment to T269324: Productionize x2 databases.

Sorry for the truly baffling error message.

Jan 21 2021, 5:52 PM · Performance-Team (Radar), Patch-For-Review, DBA

Jan 20 2021

CDanis committed rLPRI874c801ab6d5: add bot_posts_blocked_nets (authored by CDanis).
add bot_posts_blocked_nets
Jan 20 2021, 8:59 PM
CDanis added a comment to T272539: run-puppet-agent --enable flag is broken.

There's one related problem, which is that enable-puppet should check the given message both with and without appending - $SUDO_USER, as perhaps you set a disable-puppet from a context where that wasn't present, for example https://sal.toolforge.org/log/8BlxIXcBgTbpqNOm5XlV

Jan 20 2021, 8:44 PM · Puppet, SRE
CDanis added a comment to T272539: run-puppet-agent --enable flag is broken.

There's one related problem, which is that enable-puppet should check the given message both with and without appending - $SUDO_USER, as perhaps you set a disable-puppet from a context where that wasn't present, for example https://sal.toolforge.org/log/8BlxIXcBgTbpqNOm5XlV

Jan 20 2021, 8:43 PM · Puppet, SRE
CDanis edited projects for T272539: run-puppet-agent --enable flag is broken, added: Puppet; removed puppet-compiler.
Jan 20 2021, 8:24 PM · Puppet, SRE
CDanis created T272539: run-puppet-agent --enable flag is broken.
Jan 20 2021, 8:24 PM · Puppet, SRE

Jan 15 2021

CDanis triaged T271587: Create auto-populated LDAP group of those who have production shell access as Medium priority.

John has it right -- I wanted to lower the bar, and ensure all deployers / folks with any sort of shell access have Klaxon access. I wasn't sure that set was covered by wmf/wmde/nda; thanks to John for proving that is indeed the case :)

Jan 15 2021, 4:58 PM · LDAP, SRE

Jan 13 2021

CDanis added a comment to T263496: Augment NEL reports with GeoIP country code and network AS number.

The long term solution here is still not clear and is very tied up with some other yet undefined long term projects, like Data Governance.

Let's just make this happen in the short term just like we have for client ip and user agent.

Jan 13 2021, 3:14 PM · Patch-For-Review, Analytics, SRE
CDanis added a comment to T263496: Augment NEL reports with GeoIP country code and network AS number.

Hey @Ottomata, I meant to get around to this last quarter but didn't. Would very much like to get some mechanism in place soon -- do you have any ideas on the best path forward? Maybe we should set up a quick meeting to chat through options?

Jan 13 2021, 3:04 PM · Patch-For-Review, Analytics, SRE
CDanis added a comment to T270618: Create Generalised blocking stratagy.

Thanks for the writeup with all the background! And for the cleanup patches so far :)

Jan 13 2021, 2:59 PM · Patch-For-Review, SRE, Traffic, netops

Jan 8 2021

CDanis created T271587: Create auto-populated LDAP group of those who have production shell access.
Jan 8 2021, 8:42 PM · LDAP, SRE

Jan 5 2021

CDanis added a comment to T270391: varnish filtering: should we automatically update public_cloud_nets .

A downside, for example with Google is that it will most likely include crawlers IPs

Jan 5 2021, 6:51 PM · User-jbond, netops, Traffic, SRE

Jan 4 2021

fgiunchedi awarded T270324: launch Klaxon: manual paging app for trusted users to escalate urgent issues to SRE a Party Time token.
Jan 4 2021, 9:22 AM · SRE-OnFire, SRE

Dec 23 2020

CDanis updated the task description for T270324: launch Klaxon: manual paging app for trusted users to escalate urgent issues to SRE.
Dec 23 2020, 8:34 PM · SRE-OnFire, SRE
CDanis closed T270324: launch Klaxon: manual paging app for trusted users to escalate urgent issues to SRE as Resolved.

Thanks to observability for initial feedback, and to @Joe @jbond and especially @RLazarus for code reviews!

Dec 23 2020, 8:32 PM · SRE-OnFire, SRE
CDanis added a project to T270324: launch Klaxon: manual paging app for trusted users to escalate urgent issues to SRE: SRE-OnFire.
Dec 23 2020, 8:32 PM · SRE-OnFire, SRE
CDanis created T270790: enable Python CI in operations/software/klaxon.
Dec 23 2020, 8:30 PM · SRE