Page MenuHomePhabricator

CDanis (Chris Danis)
SRE @ WMF

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Nov 5 2018, 2:54 PM (62 w, 4 d)
Availability
Available
IRC Nick
cdanis
LDAP User
CDanis
MediaWiki User
CDanis (WMF) [ Global Accounts ]

Recent Activity

Yesterday

CDanis triaged T242992: decom grafana1001 as Medium priority.
Thu, Jan 16, 4:52 PM · observability
CDanis created T242992: decom grafana1001.
Thu, Jan 16, 4:30 PM · observability

Tue, Jan 14

CDanis closed T241277: puppet-merge can't accept an explicit SHA1 for an --ops merge as Resolved.
Tue, Jan 14, 4:11 PM · Patch-For-Review, Puppet, Operations

Thu, Jan 9

CDanis added a comment to T241277: puppet-merge can't accept an explicit SHA1 for an --ops merge.

A simple option: if puppet-merge.sh is given a treeish, it *only* does the ops repo or the labsprivate repo (depending on what flag was passed).

Thu, Jan 9, 3:38 PM · Patch-For-Review, Puppet, Operations
CDanis changed the status of T241374: fastnetmon misreports attack type and protocol from Open to Stalled.

Believe this has been worked around for now.

Thu, Jan 9, 3:11 PM · Patch-For-Review, netops, Operations

Wed, Jan 8

CDanis reopened T190090: Offload pings to dedicated server as "Open".

boldly re-opening this, now that the POPs have Ganeti clusters available.

Wed, Jan 8, 6:46 PM · Patch-For-Review, netops, Operations, Traffic

Tue, Jan 7

CDanis added a comment to T240425: cp3055 crashed.

Nothing in racadm getsel or racadm lclog view (latter just has me logging in over SSH).

Tue, Jan 7, 11:09 PM · Traffic, Operations

Mon, Jan 6

CDanis closed T241281: fastnetmon fired for routine text-lb.esams traffic as Resolved.
Mon, Jan 6, 11:33 PM · Operations, netops
CDanis claimed T167689: Add RIPE atlas data to Prometheus.

The steps outlined in Filippo's comment happened, with the difference that I chose to use the netmon* machines for this role.

Mon, Jan 6, 4:07 PM · observability, Operations

Tue, Dec 31

CDanis created T241653: two failing upload VTC tests.
Tue, Dec 31, 7:24 PM · Traffic, Operations

Mon, Dec 23

CDanis created T241374: fastnetmon misreports attack type and protocol .
Mon, Dec 23, 5:47 PM · Patch-For-Review, netops, Operations

Sat, Dec 21

CDanis claimed T241281: fastnetmon fired for routine text-lb.esams traffic.

I'll keep an eye on this and close if there's no other noise.

Sat, Dec 21, 2:40 PM · Operations, netops
CDanis added a comment to T241281: fastnetmon fired for routine text-lb.esams traffic.

Thanks! I had been confused by Attack protocol: tcp in the reports.

Sat, Dec 21, 2:18 PM · Operations, netops

Fri, Dec 20

CDanis created T241281: fastnetmon fired for routine text-lb.esams traffic.
Fri, Dec 20, 9:13 PM · Operations, netops
CDanis created P10006 fastnetmon text-lb.esams normal peak load.
Fri, Dec 20, 9:11 PM
CDanis updated subscribers of T241277: puppet-merge can't accept an explicit SHA1 for an --ops merge.
Fri, Dec 20, 8:17 PM · Patch-For-Review, Puppet, Operations
CDanis created T241277: puppet-merge can't accept an explicit SHA1 for an --ops merge.
Fri, Dec 20, 8:13 PM · Patch-For-Review, Puppet, Operations
CDanis committed rOSHO8da4774c4ef7: devices: sort by fqdn (authored by CDanis).
devices: sort by fqdn
Fri, Dec 20, 4:33 PM

Thu, Dec 19

CDanis claimed T224888: Network port utilization alerts should be paging .
Thu, Dec 19, 3:13 PM · observability, Traffic, Operations, netops
CDanis claimed T237587: Determine & implement near-term method for escalating network alerts.
Thu, Dec 19, 3:13 PM · Operations, netops, observability

Dec 18 2019

CDanis closed T240991: Does an eqiad Mediawiki need to have codfw DB servers in its hostsByName? and vice versa as Resolved.
Dec 18 2019, 6:36 PM · Core Platform Team Workboards (Clinic Duty Team), DBA, Wikimedia-Rdbms
CDanis closed T240991: Does an eqiad Mediawiki need to have codfw DB servers in its hostsByName? and vice versa, a subtask of T239900: Sync understanding of MediaWiki rdbms 'weight' behaviour with DBAs, as Resolved.
Dec 18 2019, 6:36 PM · Core Platform Team Workboards (Clinic Duty Team), DBA, Wikimedia-Rdbms
CDanis closed T229676: #dbctl: generate hostsByName section as well as Resolved.
Dec 18 2019, 6:36 PM · conftool
CDanis added a comment to T240991: Does an eqiad Mediawiki need to have codfw DB servers in its hostsByName? and vice versa.

BTW I'm curious about both 1) the current need (or lack thereof) for cross-DC Mediawiki DB traffic, and also about 2) the future need. As it would be fairly trivial to make dbctl output this if/when we do need it -- but enabling hostsByName from etcd now is even easier.

Dec 18 2019, 4:40 PM · Core Platform Team Workboards (Clinic Duty Team), DBA, Wikimedia-Rdbms
CDanis edited P9943 Masterwork From Distant Lands.
Dec 18 2019, 3:52 PM
CDanis created P9939 db1133 processlist.
Dec 18 2019, 1:47 PM
CDanis added a comment to T240789: Return traffic to eqiad WMCS triggering FNM.

+1 to doing #1 and revisiting if it becomes a problem again.

Dec 18 2019, 1:20 PM · Patch-For-Review, cloud-services-team (Kanban), Operations, netops

Dec 17 2019

CDanis closed T229686: #dbctl: manage 'externalLoads' data as Resolved.

Seems to be working well.

Dec 17 2019, 11:35 PM · Performance-Team, DBA, conftool
CDanis updated the task description for T241001: cp3050 depooled due to explosion in CPU usage and inuse sockets.
Dec 17 2019, 10:15 PM · Wikimedia-Incident, Traffic, Operations
CDanis created T241001: cp3050 depooled due to explosion in CPU usage and inuse sockets.
Dec 17 2019, 10:11 PM · Wikimedia-Incident, Traffic, Operations
CDanis updated the task description for T240991: Does an eqiad Mediawiki need to have codfw DB servers in its hostsByName? and vice versa.
Dec 17 2019, 8:32 PM · Core Platform Team Workboards (Clinic Duty Team), DBA, Wikimedia-Rdbms
CDanis updated the task description for T240991: Does an eqiad Mediawiki need to have codfw DB servers in its hostsByName? and vice versa.
Dec 17 2019, 8:22 PM · Core Platform Team Workboards (Clinic Duty Team), DBA, Wikimedia-Rdbms
CDanis created T240991: Does an eqiad Mediawiki need to have codfw DB servers in its hostsByName? and vice versa.
Dec 17 2019, 8:21 PM · Core Platform Team Workboards (Clinic Duty Team), DBA, Wikimedia-Rdbms
CDanis updated the task description for T237424: stunnel-wrap all rsync::server usage.
Dec 17 2019, 7:21 PM · Operations
CDanis added a parent task for T237424: stunnel-wrap all rsync::server usage: T240941: Clean up SSL configuration.
Dec 17 2019, 5:48 PM · Operations
CDanis added a subtask for T240941: Clean up SSL configuration: T237424: stunnel-wrap all rsync::server usage.
Dec 17 2019, 5:48 PM · Patch-For-Review, Puppet, Operations, User-jbond

Dec 16 2019

CDanis renamed T214734: All debug hosts give (likely spurious) message: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp) from PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) to All debug hosts give (likely spurious) message: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp).
Dec 16 2019, 10:00 PM · Patch-For-Review, Release-Engineering-Team, Performance-Team (Radar), serviceops, Wikimedia-production-error, User-fgiunchedi, Operations

Dec 13 2019

CDanis added a comment to T229686: #dbctl: manage 'externalLoads' data.

I tested this on mwdebug1001 by manually installing my patch there.

Dec 13 2019, 10:30 PM · Performance-Team, DBA, conftool
CDanis created P9869 (An Untitled Masterwork).
Dec 13 2019, 6:11 PM
CDanis created P9868 (An Untitled Masterwork).
Dec 13 2019, 5:55 PM
CDanis added a comment to T239900: Sync understanding of MediaWiki rdbms 'weight' behaviour with DBAs.

At the MediaWiki level, that would look like assigning a non-zero value in $wgLBFactoryConf['sectionLoads'] and/or $wgLBFactoryConf['groupLoadsBySection'] and/or $wgLBFactoryConf['groupLoadsByDB']. I don't know how that translates into dbctl.

Dec 13 2019, 3:43 PM · Core Platform Team Workboards (Clinic Duty Team), DBA, Wikimedia-Rdbms

Dec 12 2019

CDanis created P9865 (An Untitled Masterwork).
Dec 12 2019, 4:32 PM
CDanis claimed T229686: #dbctl: manage 'externalLoads' data.
Dec 12 2019, 3:50 PM · Performance-Team, DBA, conftool

Dec 11 2019

CDanis committed rOSCT6da2d43dc772: debian release 1.3.0-1 (authored by CDanis).
debian release 1.3.0-1
Dec 11 2019, 11:30 PM
CDanis added a comment to T240488: Network issues reaching phabricator on IPv6 (Comcast/Portland OR).

We also have two probes on Comcast's network constantly performing pings towards our RIPE Atlas anchor in ulsfo. Their network performance looks relatively stable over the past 24h: https://w.wiki/DiS

Dec 11 2019, 8:49 PM · Operations, netops
CDanis added a comment to T240488: Network issues reaching phabricator on IPv6 (Comcast/Portland OR).

I did some ICMP pings and TCP port 443 traceroutes from RIPE Atlas probes on Comcast's network that had IPv6 enabled. There were a few that couldn't reach Phabricator, including one in the Kirkland area (so probably reasonably close in network-space to @brion ), but it's hard to be sure what this means -- there's always going to be some ambient background number of probes that are malfunctioning in some way. (And most of the probes that failed weren't on the west coast, but in the midwest or on the east coast, which will be a different part of Comcast's network that shows as congested in your mtr.)

Dec 11 2019, 8:34 PM · Operations, netops
CDanis updated the task description for T240495: investigate making 'notrack' the default on our ferm rules.
Dec 11 2019, 7:34 PM · Operations
CDanis updated subscribers of T240495: investigate making 'notrack' the default on our ferm rules.
Dec 11 2019, 7:33 PM · Operations
CDanis created T240495: investigate making 'notrack' the default on our ferm rules.
Dec 11 2019, 7:30 PM · Operations
CDanis committed rOSCTa4d1479ecf3a: dbctl: generate externalLoads (authored by CDanis).
dbctl: generate externalLoads
Dec 11 2019, 4:58 PM
CDanis created T240409: a few appservers at a time suffer mcrouter backlogs, leading to high latency.
Dec 11 2019, 12:48 AM · serviceops
CDanis created T240405: WikiPage::updateCategoryCounts causing replication lag due to long-running writes on commonswiki.
Dec 11 2019, 12:25 AM · Core Platform Team Workboards (Clinic Duty Team), Release-Engineering-Team, Wikimedia-Rdbms, Operations

Dec 9 2019

CDanis closed T239039: Varnish traffic drop alert @ codfw is noisy / codfw incoming traffic is spikey as Resolved.

Looking at some data in grafana explore, this would have solved most cases of noise in the past few months. So calling it resolved for now.

Dec 9 2019, 7:16 PM · Traffic, Operations, observability

Dec 6 2019

CDanis created T240048: Make grafana-next.wm.o HTTP 302 redirect to grafana.wm.o.
Dec 6 2019, 8:26 PM · observability, Operations
CDanis updated the title for P9834 kludge up a "number of seconds we saw the NIC close to saturation" exporter from untitled to kludge up a "number of seconds we saw the NIC close to saturation" exporter.
Dec 6 2019, 8:10 PM
CDanis created P9834 kludge up a "number of seconds we saw the NIC close to saturation" exporter.
Dec 6 2019, 8:04 PM

Dec 5 2019

CDanis created T239928: jerkins-bot should not post on IRC for Gerrit changes marked 'WIP'.
Dec 5 2019, 4:31 PM · Release-Engineering-Team (CI & Testing services), Jenkins, Release-Engineering-Team-TODO

Dec 4 2019

CDanis created T239862: unwind the Puppetized /etc/hosts override of statsd.eqiad.wmnet.
Dec 4 2019, 9:27 PM · Performance-Team (Radar), Operations
CDanis added a comment to P9809 another atlas_exporter prometheus-atlas-exporter crash (concurrent map iteration and map write).

another instance of P9808

Dec 4 2019, 9:11 PM
CDanis updated the title for P9809 another atlas_exporter prometheus-atlas-exporter crash (concurrent map iteration and map write) from Masterwork From Distant Lands to another atlas_exporter prometheus-atlas-exporter crash (concurrent map iteration and map write).
Dec 4 2019, 9:11 PM
CDanis edited P9809 another atlas_exporter prometheus-atlas-exporter crash (concurrent map iteration and map write).
Dec 4 2019, 9:10 PM
CDanis updated the title for P9808 atlas_exporter prometheus-atlas-exporter crash (concurrent map iteration and map write) from Masterwork From Distant Lands to atlas_exporter prometheus-atlas-exporter crash (concurrent map iteration and map write).
Dec 4 2019, 6:31 PM
CDanis edited P9808 atlas_exporter prometheus-atlas-exporter crash (concurrent map iteration and map write).
Dec 4 2019, 6:31 PM
CDanis edited P9807 Masterwork From Distant Lands.
Dec 4 2019, 3:57 PM

Dec 3 2019

CDanis added a comment to T229686: #dbctl: manage 'externalLoads' data.

Any rough ETA on when externalLoads will be able to be handled by dbctl?

Dec 3 2019, 4:51 PM · Performance-Team, DBA, conftool

Nov 27 2019

CDanis added a comment to T234450: Some Special:Contributions requests cause "Error: 0" from database or WMFTimeoutException.

I've reapplied and deployed the "500 limit patch" from T234450#5595510 to wmf.5 as a (hopefully) very temporary measure while we continue to further troubleshoot this issue and tweak the PoolCounter solution.

Nov 27 2019, 10:14 PM · MW-1.35-notes (1.35.0-wmf.5; 2019-11-05), Patch-For-Review, User-notice, Core Platform Team Workboards (Clinic Duty Team), Vuln-DoS, Security, Performance Issue, MediaWiki-Special-pages, Wikimedia-production-error
CDanis created P9774 (An Untitled Masterwork).
Nov 27 2019, 8:19 PM
CDanis updated the task description for T239334: Python3 style guide.
Nov 27 2019, 5:23 PM · Patch-For-Review, User-ArielGlenn, User-jbond, Operations, Puppet

Nov 26 2019

CDanis renamed T237319: ATS serving 502 errors due to malformed responses from wikibase (HTTP 304s with message body content) from 502 errors on ATS/8.0.5 to ATS serving 502 errors due to malformed responses from wikibase (HTTP 304s with message body content).
Nov 26 2019, 9:17 PM · User-Ladsgroup, Wikidata, Wikidata-Campsite, Operations, Traffic, User-DannyS712
CDanis added a project to T239121: VE edit data stopped due to statsv falling over (?) on webperf1001: observability.
Nov 26 2019, 7:32 PM · Performance-Team (Radar), observability, Analytics, Editing-team
CDanis added a comment to T224888: Network port utilization alerts should be paging .

That looks good! We might want to create a specific LibreNMS alert for the transit/peering links only, but can start with the existing ones.

Nov 26 2019, 4:13 PM · observability, Traffic, Operations, netops
CDanis added a comment to T224888: Network port utilization alerts should be paging .

Sounds great to me! I am assuming on the icinga side it'll be only one alert at least to start with, for e.g. silencing purposes which I think will work fine for now.

Nov 26 2019, 3:29 PM · observability, Traffic, Operations, netops

Nov 25 2019

Peter awarded T220838: Upgrade grafana to 6.4.4 a Yellow Medal token.
Nov 25 2019, 8:34 PM · Patch-For-Review, Performance-Team (Radar), observability, Operations
CDanis closed T220838: Upgrade grafana to 6.4.4 as Resolved.

Grafana 6.4.4 is now in use at https://grafana.wikimedia.org.

Nov 25 2019, 8:01 PM · Patch-For-Review, Performance-Team (Radar), observability, Operations
CDanis updated the title for P9740 confctl --quiet select 'service=nginx' get | jq 'select(..|.pooled? == "no") | select(.tags | contains("cluster=cache_"))' from Masterwork From Distant Lands to confctl --quiet select 'service=nginx' get | jq 'select(..|.pooled? == "no") | select(.tags | contains("cluster=cache_"))'.
Nov 25 2019, 7:30 PM
CDanis updated subscribers of T224888: Network port utilization alerts should be paging .

I've a proposal for doing this:

Nov 25 2019, 4:52 PM · observability, Traffic, Operations, netops
CDanis moved T239039: Varnish traffic drop alert @ codfw is noisy / codfw incoming traffic is spikey from Inbox to In progress on the observability board.
Nov 25 2019, 4:07 PM · Traffic, Operations, observability
CDanis updated the task description for T239039: Varnish traffic drop alert @ codfw is noisy / codfw incoming traffic is spikey.
Nov 25 2019, 1:09 AM · Traffic, Operations, observability
CDanis created T239039: Varnish traffic drop alert @ codfw is noisy / codfw incoming traffic is spikey.
Nov 25 2019, 1:08 AM · Traffic, Operations, observability

Nov 22 2019

CDanis added a comment to T238939: Increased latency in appservers - 22 Nov 2019.

At ~18:36 there was another spike in long-tail latency, but then, latency seemed to return to 'normal':
https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1574443410296&to=1574450056679

Nov 22 2019, 7:30 PM · Traffic, Operations, serviceops
CDanis added a comment to T238833: Create NRPE check to alert when cergen certificates are due to expire.

Could we make the cergen script itself modify the permissions after it creates the files? It won't ensure things are right ongoing, but at least they'd be right from the start.

Nov 22 2019, 5:23 PM · Patch-For-Review, User-jbond, Puppet, Operations

Nov 21 2019

CDanis updated the task description for T220838: Upgrade grafana to 6.4.4.
Nov 21 2019, 6:33 PM · Patch-For-Review, Performance-Team (Radar), observability, Operations
CDanis added a comment to T220838: Upgrade grafana to 6.4.4.

I've heard no complaints, and can verify from the logs that it's seen at least some testing by others. Planning to do a final snapshot and move traffic over on Monday afternoon my time.

Nov 21 2019, 6:32 PM · Patch-For-Review, Performance-Team (Radar), observability, Operations

Nov 20 2019

CDanis added a comment to T234450: Some Special:Contributions requests cause "Error: 0" from database or WMFTimeoutException.

Sure, I can create the configuration patch.

Nov 20 2019, 9:17 PM · MW-1.35-notes (1.35.0-wmf.5; 2019-11-05), Patch-For-Review, User-notice, Core Platform Team Workboards (Clinic Duty Team), Vuln-DoS, Security, Performance Issue, MediaWiki-Special-pages, Wikimedia-production-error
CDanis edited P9696 Masterwork From Distant Lands.
Nov 20 2019, 6:13 PM
CDanis added a comment to T238695: File on Commons not found: File:Nl-gegourmet.ogg.

I grepped through both the swiftrepl logs on ms-fe1005 and also
the aggregated Swift mutation-operation logs on centrallog1001 and
found no mention of the file.

Nov 20 2019, 2:59 PM · SRE-swift-storage, Operations, Commons
CDanis closed T238597: envoyproxy does not automatically reload certificates as Resolved.
Nov 20 2019, 2:41 PM · serviceops, Operations

Nov 19 2019

CDanis updated CDanis.
Nov 19 2019, 8:07 PM
CDanis added a comment to T234450: Some Special:Contributions requests cause "Error: 0" from database or WMFTimeoutException.

Tim suggested 2 as a concurrency limit. I think we can start less conservative than that, though -- let's say 10? It feels pretty hard for that to hurt normal user traffic, while it should still prevent excessive usage.

Nov 19 2019, 8:01 PM · MW-1.35-notes (1.35.0-wmf.5; 2019-11-05), Patch-For-Review, User-notice, Core Platform Team Workboards (Clinic Duty Team), Vuln-DoS, Security, Performance Issue, MediaWiki-Special-pages, Wikimedia-production-error

Nov 18 2019

CDanis updated the task description for T220838: Upgrade grafana to 6.4.4.
Nov 18 2019, 11:17 PM · Patch-For-Review, Performance-Team (Radar), observability, Operations
CDanis added a comment to T220838: Upgrade grafana to 6.4.4.

upgraded the pie chart plugin to a recent version that actually works with 6.x:

❌cdanis@grafana1002.eqiad.wmnet ~ 🕕🍺 sudo http_proxy=http://webproxy.eqiad.wmnet:8080 grafana-cli plugins install grafana-piechart-panel
Nov 18 2019, 11:10 PM · Patch-For-Review, Performance-Team (Radar), observability, Operations
CDanis created T238597: envoyproxy does not automatically reload certificates.
Nov 18 2019, 9:55 PM · serviceops, Operations
CDanis renamed T220838: Upgrade grafana to 6.4.4 from Upgrade grafana to 6.x to Upgrade grafana to 6.4.4.
Nov 18 2019, 7:38 PM · Patch-For-Review, Performance-Team (Radar), observability, Operations

Nov 15 2019

CDanis committed rOSCT1e3e2d489460: dbctl: rename 'wikitech' to 's10' to match prod (authored by CDanis).
dbctl: rename 'wikitech' to 's10' to match prod
Nov 15 2019, 2:54 AM

Nov 14 2019

CDanis closed T238347: Add new SSH key for production access as Resolved.
Nov 14 2019, 5:08 PM · SRE-Access-Requests, Operations

Nov 12 2019

CDanis added a parent task for T211538: Report cpu seconds spent from MediaWiki to Graphite: Unknown Object (Task).
Nov 12 2019, 7:27 PM · Performance-Team

Nov 8 2019

CDanis added a comment to T234450: Some Special:Contributions requests cause "Error: 0" from database or WMFTimeoutException.

+1 to @tstarling's proposal.

Nov 8 2019, 3:47 PM · MW-1.35-notes (1.35.0-wmf.5; 2019-11-05), Patch-For-Review, User-notice, Core Platform Team Workboards (Clinic Duty Team), Vuln-DoS, Security, Performance Issue, MediaWiki-Special-pages, Wikimedia-production-error

Nov 6 2019

CDanis claimed T237424: stunnel-wrap all rsync::server usage.

Generated a list of all $hosts_allow arguments from rsync::server::module invocations across all of Puppet: P9544

Nov 6 2019, 9:21 PM · Operations
CDanis updated the title for P9544 ✔️ cdanis@puppetdb1001.eqiad.wmnet ~/resources 🕓🍵 jq '.[] | select(.type == "Rsync::Server::Module") | {"node": .certname, "title": .title, "hosts_allow":.parameters.hosts_allow}' * | phaste from Masterwork From Distant Lands to ✔️ cdanis@puppetdb1001.eqiad.wmnet ~/resources 🕓🍵 jq '.[] | select(.type == "Rsync::Server::Module") | {"node": .certname, "title": .title, "hosts_allow":.parameters.hosts_allow}' * | phaste.
Nov 6 2019, 8:48 PM
CDanis added a comment to T237424: stunnel-wrap all rsync::server usage.

Another thing that just came up: not all users of rsync::server::module are actually passing an array to the $hosts_allow argument: https://gerrit.wikimedia.org/r/c/operations/puppet/+/549142
Need to go through PuppetDB and look for this.

Nov 6 2019, 6:39 PM · Operations