Page MenuHomePhabricator

CDanis (Chris Danis)
SRE @ WMF

Projects (8)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Nov 5 2018, 2:54 PM (92 w, 4 d)
Availability
Available
IRC Nick
cdanis
LDAP User
CDanis
MediaWiki User
CDanis (WMF) [ Global Accounts ]

Recent Activity

Yesterday

CDanis created T260452: clean up workaround and measurements put in place during Jio RPKI error.
Fri, Aug 14, 5:10 PM · Patch-For-Review, Operations, netops, Traffic
CDanis closed T260449: Users of Jio ISP (India, AS 55836) unable to reach Wikimedia sites as Resolved.

There's still an issue on Jio's side that needs to be fixed by them, but, we've put a temporary workaround in place, and their users should be able to access Wikipedia and other WMF sites. Please let us know if that isn't the case!

Fri, Aug 14, 5:02 PM · Operations, netops, Traffic
CDanis closed Restricted Task, a subtask of T260449: Users of Jio ISP (India, AS 55836) unable to reach Wikimedia sites, as Resolved.
Fri, Aug 14, 5:02 PM · Operations, netops, Traffic
CDanis updated the task description for T260449: Users of Jio ISP (India, AS 55836) unable to reach Wikimedia sites.
Fri, Aug 14, 4:52 PM · Operations, netops, Traffic
CDanis added a comment to T260449: Users of Jio ISP (India, AS 55836) unable to reach Wikimedia sites.

For posterity, relevant workaround patch and deployment thereof: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/620377
https://sal.toolforge.org/production?p=0&q=I9fcff8&d=

Fri, Aug 14, 4:51 PM · Operations, netops, Traffic
CDanis added a subtask for T260449: Users of Jio ISP (India, AS 55836) unable to reach Wikimedia sites: Unknown Object (Task).
Fri, Aug 14, 4:50 PM · Operations, netops, Traffic
CDanis created T260449: Users of Jio ISP (India, AS 55836) unable to reach Wikimedia sites.
Fri, Aug 14, 4:50 PM · Operations, netops, Traffic

Thu, Aug 13

CDanis added a comment to T260281: mw* servers memory leaks (12 Aug).

A thing that someone daring in EUTZ might want to try: Using perf probe, or by modifying the bpfcc-memleak script, or by writing a trivial bpftrace script: attach a tracepoint to memcg_schedule_kmem_cache_create and gather calling stacktraces. That's the function that creates the work item that results in a worker thread calling memcg_create_kmem_cache, as seen in the stack traces we saw for 32-byte mallocs.

Thu, Aug 13, 1:29 AM · Wikidata, Platform Engineering, Sustainability (Incident Followup), Operations, serviceops

Wed, Aug 12

CDanis added a comment to T260281: mw* servers memory leaks (12 Aug).

There's some preliminary results from tracing 4096-byte allocations at

1[21:27:29] Top 10 stacks with outstanding allocations:
2 2416640 bytes in 590 allocations from stack
3 __alloc_pages_nodemask+0x1c2 [kernel]
4 __alloc_pages_nodemask+0x1c2 [kernel]
5 alloc_pages_current+0x91 [kernel]
6 pagecache_get_page+0xb4 [kernel]
7 grab_cache_page_write_begin+0x1c [kernel]
8 ext4_da_write_begin+0x9f [ext4]
9 ext4_da_write_end+0x142 [ext4]
10 generic_perform_write+0xcf [kernel]
11 __generic_file_write_iter+0x16a [kernel]
12 ext4_file_write_iter+0x90 [ext4]
13 mem_cgroup_try_charge+0x65 [kernel]
14 new_sync_write+0xe0 [kernel]
15 vfs_write+0xb0 [kernel]
16 sys_write+0x5a [kernel]
17 do_syscall_64+0x8d [kernel]
18 entry_SYSCALL_64_after_swapgs+0x58 [kernel]
19 2609152 bytes in 637 allocations from stack
20 __alloc_pages_nodemask+0x1c2 [kernel]
21 __alloc_pages_nodemask+0x1c2 [kernel]
22 alloc_pages_vma+0xaa [kernel]
23 handle_mm_fault+0x10ff [kernel]
24 new_sync_read+0xdd [kernel]
25 __do_page_fault+0x255 [kernel]
26 page_fault+0x28 [kernel]
27 3035136 bytes in 741 allocations from stack
28 __alloc_pages_nodemask+0x1c2 [kernel]
29 __alloc_pages_nodemask+0x1c2 [kernel]
30 alloc_pages_current+0x91 [kernel]
31 pagecache_get_page+0xb4 [kernel]
32 grab_cache_page_write_begin+0x1c [kernel]
33 ext4_da_write_begin+0x9f [ext4]
34 ext4_da_write_end+0x142 [ext4]
35 generic_perform_write+0xcf [kernel]
36 __generic_file_write_iter+0x16a [kernel]
37 ext4_file_write_iter+0x90 [ext4]
38 perf_trace_run_bpf_submit+0x4a [kernel]
39 new_sync_write+0xe0 [kernel]
40 vfs_write+0xb0 [kernel]
41 sys_write+0x5a [kernel]
42 do_syscall_64+0x8d [kernel]
43 entry_SYSCALL_64_after_swapgs+0x58 [kernel]
44 3244032 bytes in 792 allocations from stack
45 __alloc_pages_nodemask+0x1c2 [kernel]
46 __alloc_pages_nodemask+0x1c2 [kernel]
47 pcpu_populate_chunk+0xbd [kernel]
48 pcpu_balance_workfn+0x59a [kernel]
49 __switch_to_asm+0x41 [kernel]
50 __switch_to_asm+0x35 [kernel]
51 process_one_work+0x18a [kernel]
52 worker_thread+0x4d [kernel]
53 worker_thread+0x0 [kernel]
54 kthread+0xd9 [kernel]
55 kthread+0x0 [kernel]
56 ret_from_fork+0x57 [kernel]
57 4063232 bytes in 992 allocations from stack
58 __alloc_pages_nodemask+0x1c2 [kernel]
59 __alloc_pages_nodemask+0x1c2 [kernel]
60 alloc_pages_vma+0xaa [kernel]
61 handle_mm_fault+0x10ff [kernel]
62 do_nanosleep+0x8d [kernel]
63 __do_page_fault+0x255 [kernel]
64 page_fault+0x28 [kernel]
65 9363456 bytes in 2286 allocations from stack
66 __alloc_pages_nodemask+0x1c2 [kernel]
67 __alloc_pages_nodemask+0x1c2 [kernel]
68 alloc_pages_vma+0xaa [kernel]
69 handle_mm_fault+0x10ff [kernel]
70 __do_page_fault+0x255 [kernel]
71 sys_brk+0x160 [kernel]
72 page_fault+0x28 [kernel]
73 21508096 bytes in 5251 allocations from stack
74 __alloc_pages_nodemask+0x1c2 [kernel]
75 __alloc_pages_nodemask+0x1c2 [kernel]
76 alloc_pages_current+0x91 [kernel]
77 pagecache_get_page+0xb4 [kernel]
78 grab_cache_page_write_begin+0x1c [kernel]
79 ext4_da_write_begin+0x9f [ext4]
80 ext4_da_write_end+0x142 [ext4]
81 generic_perform_write+0xcf [kernel]
82 __generic_file_write_iter+0x16a [kernel]
83 ext4_file_write_iter+0x90 [ext4]
84 do_readv_writev+0x151 [kernel]
85 new_sync_write+0xe0 [kernel]
86 vfs_write+0xb0 [kernel]
87 sys_write+0x5a [kernel]
88 do_syscall_64+0x8d [kernel]
89 entry_SYSCALL_64_after_swapgs+0x58 [kernel]
90 47345664 bytes in 11559 allocations from stack
91 __alloc_pages_nodemask+0x1c2 [kernel]
92 __alloc_pages_nodemask+0x1c2 [kernel]
93 alloc_pages_vma+0xaa [kernel]
94 handle_mm_fault+0x10ff [kernel]
95 __do_page_fault+0x255 [kernel]
96 page_fault+0x28 [kernel]
97 59744256 bytes in 14586 allocations from stack
98 __alloc_pages_nodemask+0x1c2 [kernel]
99 __alloc_pages_nodemask+0x1c2 [kernel]
100 alloc_pages_current+0x91 [kernel]
101 pagecache_get_page+0xb4 [kernel]
102 grab_cache_page_write_begin+0x1c [kernel]
103 ext4_da_write_begin+0x9f [ext4]
104 ext4_da_write_end+0x142 [ext4]
105 generic_perform_write+0xcf [kernel]
106 __generic_file_write_iter+0x16a [kernel]
107 ext4_file_write_iter+0x90 [ext4]
108 new_sync_write+0xe0 [kernel]
109 vfs_write+0xb0 [kernel]
110 sys_write+0x5a [kernel]
111 do_syscall_64+0x8d [kernel]
112 entry_SYSCALL_64_after_swapgs+0x58 [kernel]
113 61636608 bytes in 15048 allocations from stack
114 __alloc_pages_nodemask+0x1c2 [kernel]
115 __alloc_pages_nodemask+0x1c2 [kernel]
116 pcpu_populate_chunk+0xbd [kernel]
117 pcpu_balance_workfn+0x59a [kernel]
118 __switch_to_asm+0x41 [kernel]
119 __switch_to_asm+0x35 [kernel]
120 process_one_work+0x18a [kernel]
121 worker_thread+0x4d [kernel]
122 worker_thread+0x0 [kernel]
123 kthread+0xd9 [kernel]
124 __switch_to_asm+0x41 [kernel]
125 kthread+0x0 [kernel]
126 ret_from_fork+0x57 [kernel]
127
although some of the stack traces make me think I need to make sure the script is tracing only the kmalloc codepaths.

Wed, Aug 12, 9:30 PM · Wikidata, Platform Engineering, Sustainability (Incident Followup), Operations, serviceops
CDanis added a comment to T260281: mw* servers memory leaks (12 Aug).

Here's a summary of my findings so far, although keep in mind that I'm well out of my depth here and this is an amalgamation of guesswork.

Wed, Aug 12, 9:19 PM · Wikidata, Platform Engineering, Sustainability (Incident Followup), Operations, serviceops
CDanis updated the title for P12237 ~5hours of `memleak-bpfcc -o 60000 -z 31 -Z 33 30` on mw1359 from Masterwork From Distant Lands to ~5hours of `memleak-bpfcc -o 60000 -z 31 -Z 33 30` on mw1359.
Wed, Aug 12, 9:17 PM
CDanis updated the title for P12234 ✔️ cdanis@mw1359.eqiad.wmnet ~ 🕔🍵 (uptime; sudo slabtop -o )| phaste from Masterwork From Distant Lands to ✔️ cdanis@mw1359.eqiad.wmnet ~ 🕔🍵 (uptime; sudo slabtop -o )| phaste.
Wed, Aug 12, 8:59 PM
CDanis added a comment to P12232 ✔️ cdanis@mw1322.eqiad.wmnet ~ 🕔🍵 sudo slabtop -o | phaste.

sampled after ~freshly rebooted 5 minutes prior

Wed, Aug 12, 8:58 PM
CDanis updated the title for P12232 ✔️ cdanis@mw1322.eqiad.wmnet ~ 🕔🍵 sudo slabtop -o | phaste from Masterwork From Distant Lands to ✔️ cdanis@mw1322.eqiad.wmnet ~ 🕔🍵 sudo slabtop -o | phaste.
Wed, Aug 12, 8:58 PM
CDanis updated the title for P12231 ✔️ cdanis@mw1357.eqiad.wmnet ~ 🕔🍵 sudo slabtop -o | phaste from Masterwork From Distant Lands to ✔️ cdanis@mw1357.eqiad.wmnet ~ 🕔🍵 sudo slabtop -o | phaste.
Wed, Aug 12, 8:57 PM
CDanis created T260241: Grafana/Thanos serves 503s for long-time-window requests.
Wed, Aug 12, 1:51 PM · Operations, observability

Tue, Aug 11

CDanis added a comment to T238593: Phabricator downtime due to aphlict and websockets (aphlict current disabled).

The Envoy TLS terminator is now configured to allow websocket upgrades -- however, it's improperly configured. It's trying to connect via IPv6 address to the nodejs process, but the nodejs process is not listening there -- it listens only on the IPv4 address.

Tue, Aug 11, 1:48 PM · Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), Patch-For-Review, Phabricator, serviceops, Operations, Traffic

Wed, Jul 29

CDanis renamed T257527: automatically collect network error reports from users' browsers (Network Error Logging API) from automatically collect network error reports from users' browsers to automatically collect network error reports from users' browsers (Network Error Logging API).
Wed, Jul 29, 5:10 PM · Operations, Goal, Epic
CDanis closed T258648: monitoring for mismatched LVS realserver addresses/configurations, a subtask of T258614: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping, as Declined.
Wed, Jul 29, 3:56 PM · Operations, observability, serviceops, Traffic
CDanis closed T258648: monitoring for mismatched LVS realserver addresses/configurations as Declined.

The structure of the data makes this teeth-pullingly impossibly difficult to do well, and the current production configuration has an incredible breadth *and* depth of mismatches between the two ideas of the world held by profile::lvs::realserver::$pools vs conftool.

Wed, Jul 29, 3:56 PM · Patch-For-Review, Operations, observability, serviceops, Traffic
CDanis updated the title for P12124 profile::lvs::realserver::$pools (left side) vs conftool's node-to-service mapping (right side) from Masterwork From Distant Lands to profile::lvs::realserver::$pools (left side) vs conftool's node-to-service mapping (right side).
Wed, Jul 29, 3:56 PM
CDanis edited P12125 check_realserver_vs_conftool.sh.
Wed, Jul 29, 3:47 PM
CDanis edited P12125 check_realserver_vs_conftool.sh.
Wed, Jul 29, 3:44 PM
CDanis edited P12124 profile::lvs::realserver::$pools (left side) vs conftool's node-to-service mapping (right side).
Wed, Jul 29, 3:44 PM
CDanis edited P12124 profile::lvs::realserver::$pools (left side) vs conftool's node-to-service mapping (right side).
Wed, Jul 29, 3:40 PM
CDanis updated the title for P12125 check_realserver_vs_conftool.sh from Masterwork From Distant Lands to check_realserver_vs_conftool.sh.
Wed, Jul 29, 3:39 PM
CDanis edited P12125 check_realserver_vs_conftool.sh.
Wed, Jul 29, 3:32 PM

Tue, Jul 28

CDanis closed T258018: ripe-atlas-eqiad IPv6 unreachable as Resolved.

With the serial console now attached, I found myself in a rescue shell.

Tue, Jul 28, 1:10 PM · Operations, netops

Mon, Jul 27

CDanis closed T234450: Special:Contributions requests with a high &limit= caused excessive database load as Resolved.

We haven't seen a recurrence of this so I am optimistically resolving.

Mon, Jul 27, 12:57 PM · MW-1.31-release-notes, MW-1.33-notes, MW-1.34-notes, Platform Engineering, Security, MW-1.35-notes (1.35.0-wmf.5; 2019-11-05), User-notice, Vuln-DoS, Performance Issue, MediaWiki-Special-pages, Wikimedia-production-error
CDanis added a comment to T257002: Special:Contributions fails to load contributions with relatively small limit for high-volume users.

This was not a duplicate, and the issue still exists. T234450 had a misleading title; it isn't about any WMFTimeoutException on Special:Contributions, but a particular special kind from months ago that was creating widespread database issues, because of excessive traffic from a scraper.

Mon, Jul 27, 12:57 PM · MediaWiki-Special-pages, Wikimedia-production-error, Wikidata
CDanis renamed T234450: Special:Contributions requests with a high &limit= caused excessive database load from Some Special:Contributions requests cause "Error: 0" from database or WMFTimeoutException to Special:Contributions requests with a high &limit= caused excessive database load.
Mon, Jul 27, 12:56 PM · MW-1.31-release-notes, MW-1.33-notes, MW-1.34-notes, Platform Engineering, Security, MW-1.35-notes (1.35.0-wmf.5; 2019-11-05), User-notice, Vuln-DoS, Performance Issue, MediaWiki-Special-pages, Wikimedia-production-error
CDanis reopened T257002: Special:Contributions fails to load contributions with relatively small limit for high-volume users as "Open".
Mon, Jul 27, 12:55 PM · MediaWiki-Special-pages, Wikimedia-production-error, Wikidata

Sat, Jul 25

CDanis added a project to T257879: Drop official PHP 7.2 support in MediaWiki 1.35: serviceops.
Sat, Jul 25, 12:06 PM · Patch-For-Review, Release-Engineering-Team, serviceops, MediaWiki-Stakeholders-Group, Proposal, MediaWiki-General, PHP 7.2 support, MW-1.35-release

Thu, Jul 23

Aklapper awarded T151977: Open Redirect in secure.wikimedia.org a Like token.
Thu, Jul 23, 5:17 PM · serviceops, Security-Team, Traffic, Security, Patch-For-Review, Wikimedia-Apache-configuration, Operations
CDanis closed T151977: Open Redirect in secure.wikimedia.org as Resolved.

Re-validation forced for ATS-BE, and also a Varnish cache ban has been put in place, so we should no longer be serving any cached 301s.

Thu, Jul 23, 5:16 PM · serviceops, Security-Team, Traffic, Security, Patch-For-Review, Wikimedia-Apache-configuration, Operations
CDanis added a comment to T151977: Open Redirect in secure.wikimedia.org.

I applied @Platonides 's patch on mwdebug1002, tested it using httpbb, fixed a small error (character class was negated when it shouldn't've been), and it's now merged and being deployed.

Thu, Jul 23, 4:22 PM · serviceops, Security-Team, Traffic, Security, Patch-For-Review, Wikimedia-Apache-configuration, Operations
CDanis added a project to T255981: Persistant error 500 getting category members: Platform Team Workboards (Clinic Duty Team).

This issue has been ongoing for a while and likely merits some CPT attention.

Thu, Jul 23, 1:52 PM · Platform Team Workboards (Clinic Duty Team), Release-Engineering-Team, Upstream, Commons, ApiFeatureUsage, Pywikibot
CDanis created T258697: stop using $::site in description field of service.yaml.
Thu, Jul 23, 11:55 AM · Operations

Wed, Jul 22

CDanis edited P12023 Masterwork From Distant Lands.
Wed, Jul 22, 8:22 PM
CDanis created T258648: monitoring for mismatched LVS realserver addresses/configurations.
Wed, Jul 22, 8:07 PM · Patch-For-Review, Operations, observability, serviceops, Traffic
CDanis added a comment to T258614: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping.

First occurrence was June 17th, 15:10 UTC:

Wed, Jul 22, 4:48 PM · Operations, observability, serviceops, Traffic
CDanis updated subscribers of T258221: please connect eqiad's RIPE Atlas anchor to one of the SCSes.

Do you think you'd have time to attempt this in the next few days? The eqiad anchor being down has some impact on our monitoring, and we have a RIPE engineer waiting on us to try rebooting the anchor.

Wed, Jul 22, 12:47 PM · Operations, ops-eqiad

Tue, Jul 21

CDanis edited P11996 Masterwork From Distant Lands.
Tue, Jul 21, 3:07 PM
CDanis added a comment to T258221: please connect eqiad's RIPE Atlas anchor to one of the SCSes.

Hi @Cmjohnson -- any idea of when you might be able to get around to this? Thanks!

Tue, Jul 21, 12:21 PM · Operations, ops-eqiad

Mon, Jul 20

CDanis closed T256412: Production shell access for Chris Albon as Resolved.

Email with temporary Kerberos password sent.

Mon, Jul 20, 5:44 PM · Patch-For-Review, Release-Engineering-Team, Operations, Machine Learning Platform (Current)
CDanis added a project to T258414: Cassandra Grafana dashboards seem to disagree with actual utilization: observability.
Mon, Jul 20, 5:21 PM · Platform Team Workboards (Clinic Duty Team), observability
CDanis closed T212971: gather some stats about machine utilization: cpu/ram/iops/network? as Declined.
Mon, Jul 20, 3:29 PM · User-CDanis
CDanis closed T205396: Evaluate/integrate rasdaemon as a replacement for mcelog as Resolved.
Mon, Jul 20, 3:26 PM · Patch-For-Review, observability, Operations
CDanis closed T244761: Script to point SRE local machine traffic to another LB as Resolved.
Mon, Jul 20, 3:24 PM · Operations

Sun, Jul 19

CDanis triaged T258360: db1085 crashed as High priority.
Sun, Jul 19, 6:47 PM · DBA, Operations
CDanis created T258360: db1085 crashed.
Sun, Jul 19, 6:46 PM · DBA, Operations

Sat, Jul 18

CDanis triaged T258336: db1082 crashed as High priority.
Sat, Jul 18, 8:38 PM · Operations, DBA
CDanis created T258336: db1082 crashed.
Sat, Jul 18, 8:38 PM · Operations, DBA

Fri, Jul 17

CDanis closed T258284: Verify WikimediaFoundation.org ownership for Facebook as Resolved.
✔️ cdanis@authdns1001.wikimedia.org ~ 🕧☕ host -t txt wikimediafoundation.org
wikimediafoundation.org descriptive text "facebook-domain-verification=orjhfudqyeumf4uber3bpwan5pisu0"
Fri, Jul 17, 4:34 PM · Operations, DNS, Traffic
CDanis claimed T257527: automatically collect network error reports from users' browsers (Network Error Logging API).
Fri, Jul 17, 3:46 PM · Operations, Goal, Epic

Thu, Jul 16

CDanis added a comment to T256971: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan.

Ah -- one last thing -- you should have an email in your inbox with a temporary Kerberos password. Please follow the instructions in it and set your own password there soon.

Thu, Jul 16, 8:18 PM · Patch-For-Review, Trust-and-Safety, Operations, SRE-Access-Requests
CDanis updated the task description for T258221: please connect eqiad's RIPE Atlas anchor to one of the SCSes.
Thu, Jul 16, 7:54 PM · Operations, ops-eqiad
CDanis changed the status of T258018: ripe-atlas-eqiad IPv6 unreachable from Open to Stalled.
Thu, Jul 16, 7:54 PM · Operations, netops
CDanis added a parent task for T258221: please connect eqiad's RIPE Atlas anchor to one of the SCSes: T258018: ripe-atlas-eqiad IPv6 unreachable.
Thu, Jul 16, 7:53 PM · Operations, ops-eqiad
CDanis added a subtask for T258018: ripe-atlas-eqiad IPv6 unreachable: T258221: please connect eqiad's RIPE Atlas anchor to one of the SCSes.
Thu, Jul 16, 7:53 PM · Operations, netops
CDanis created T258221: please connect eqiad's RIPE Atlas anchor to one of the SCSes.
Thu, Jul 16, 7:50 PM · Operations, ops-eqiad
CDanis added a comment to T258214: Requesting access to analytics-privatedata-users and researchers for Agaduran.

Thanks! Will go through early next week, per the waiting period

Thu, Jul 16, 7:37 PM · SRE-Access-Requests, Operations
CDanis placed T258214: Requesting access to analytics-privatedata-users and researchers for Agaduran up for grabs.
Thu, Jul 16, 7:36 PM · SRE-Access-Requests, Operations
CDanis assigned T258214: Requesting access to analytics-privatedata-users and researchers for Agaduran to Nuria.
Thu, Jul 16, 7:34 PM · SRE-Access-Requests, Operations
CDanis assigned T256412: Production shell access for Chris Albon to calbon.

Are you still having trouble accessing the stat machines ?

Thu, Jul 16, 7:34 PM · Patch-For-Review, Release-Engineering-Team, Operations, Machine Learning Platform (Current)
CDanis assigned T257507: HTTP 500 error trying to access phabricator from my server to Isarra.
Thu, Jul 16, 7:19 PM · Security-Team, Operations, Phabricator
CDanis added a comment to T258214: Requesting access to analytics-privatedata-users and researchers for Agaduran.

We need an expiration date for this collaboration.

Thu, Jul 16, 7:15 PM · SRE-Access-Requests, Operations
CDanis added a comment to T258214: Requesting access to analytics-privatedata-users and researchers for Agaduran.

Two things pending:

  • @Nuria's approval for analytics access
  • three business day waiting period
Thu, Jul 16, 7:12 PM · SRE-Access-Requests, Operations
CDanis updated the task description for T258214: Requesting access to analytics-privatedata-users and researchers for Agaduran.
Thu, Jul 16, 7:11 PM · SRE-Access-Requests, Operations
CDanis updated subscribers of T258214: Requesting access to analytics-privatedata-users and researchers for Agaduran.

Thanks, I've verified that NDA is on file.

Thu, Jul 16, 7:07 PM · SRE-Access-Requests, Operations
CDanis closed T256971: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan as Resolved.

Thanks Nahid!

Thu, Jul 16, 6:37 PM · Patch-For-Review, Trust-and-Safety, Operations, SRE-Access-Requests
CDanis closed T258119: Requesting access to deployment for jlinehan as Resolved.

Done; should be live across the fleet within 30 minutes.

Thu, Jul 16, 6:19 PM · Operations, SRE-Access-Requests
CDanis updated the task description for T258119: Requesting access to deployment for jlinehan.
Thu, Jul 16, 6:15 PM · Operations, SRE-Access-Requests
CDanis reassigned T256971: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan from CDanis to Nahid.

Hi Nahid,

Thu, Jul 16, 5:50 PM · Patch-For-Review, Trust-and-Safety, Operations, SRE-Access-Requests
CDanis claimed T258018: ripe-atlas-eqiad IPv6 unreachable.
Thu, Jul 16, 3:45 PM · Operations, netops
CDanis claimed T256971: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan.

thanks! will do so today

Thu, Jul 16, 3:00 PM · Patch-For-Review, Trust-and-Safety, Operations, SRE-Access-Requests
CDanis closed T248097: Hive access for Sam Patton as Resolved.

@spatton I'm going to optimistically close this assuming that Turnilo access has been sufficient for you, please do reopen if it's not!

Thu, Jul 16, 12:44 PM · Analytics-Radar, SRE-Access-Requests, Operations, Product-Analytics
CDanis added a comment to T256971: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan.

@Nuria ping :)

Thu, Jul 16, 12:43 PM · Patch-For-Review, Trust-and-Safety, Operations, SRE-Access-Requests
CDanis added a comment to T256435: Requesting access to analytics-privatedata-users and nda groups for edtadros.

@Jrbranaa ping? this access request is waiting on your reply re: contract end date

Thu, Jul 16, 12:42 PM · Operations, SRE-Access-Requests
CDanis claimed T258119: Requesting access to deployment for jlinehan.
Thu, Jul 16, 12:38 PM · Operations, SRE-Access-Requests

Jul 15 2020

CDanis added a comment to T257905: Spin off common Spicerack modules into a standalone Python library importable anywhere.

I think it would be nice if there was a small helper function to set User-Agent headers per our policy.

Jul 15 2020, 6:43 PM · Goal, SRE-tools

Jul 14 2020

CDanis added a comment to T257527: automatically collect network error reports from users' browsers (Network Error Logging API).

Determine whether or not we want additional stream processing to split apart NEL responses into their component events, as each POST made to the reporting endpoint is potentially a batch of multiple error reports.

Oo, interesting, how are they batched? If they are POSTed as an array of the same NEL object that all conform to the same schema, EventGate will treat them as individual events and post them as individual messages to Kafka.

Jul 14 2020, 2:28 AM · Operations, Goal, Epic

Jul 12 2020

CDanis added a comment to T252630: LibreNMS monitoring glitch caused paging.

This happened again just now.

Jul 12 2020, 2:04 AM · Operations, observability, netops

Jul 9 2020

CDanis updated subscribers of T257527: automatically collect network error reports from users' browsers (Network Error Logging API).
Jul 9 2020, 12:16 PM · Operations, Goal, Epic

Jul 8 2020

CDanis merged T207860: Collect client network errors, deprecation, intervention and crash reports into T257527: automatically collect network error reports from users' browsers (Network Error Logging API).
Jul 8 2020, 11:28 PM · Operations, Goal, Epic
CDanis merged task T207860: Collect client network errors, deprecation, intervention and crash reports into T257527: automatically collect network error reports from users' browsers (Network Error Logging API).
Jul 8 2020, 11:28 PM · observability, Traffic, Operations
CDanis created T257527: automatically collect network error reports from users' browsers (Network Error Logging API).
Jul 8 2020, 10:48 PM · Operations, Goal, Epic
CDanis updated subscribers of T257507: HTTP 500 error trying to access phabricator from my server.
Jul 8 2020, 7:55 PM · Security-Team, Operations, Phabricator
CDanis added a comment to T253673: Avoid php-opcache corruption in WMF production.

FTR i-->h is a single bit-flip in the LSB.

Jul 8 2020, 6:23 PM · Sustainability (Incident Followup), Performance-Team, serviceops
CDanis added a comment to T253673: Avoid php-opcache corruption in WMF production.

FTR i-->h is a single bit-flip in the LSB.

Jul 8 2020, 6:09 PM · Sustainability (Incident Followup), Performance-Team, serviceops
CDanis added a comment to T257308: Upload git 2.20 package from stretch-backports to component/git.

Is there a measurable performance benefit of protocol-2 over the default git? We can do that, but it comes at a cost as we need to backport future git vulnerabilties to that version.

Jul 8 2020, 12:30 PM · Operations
CDanis placed T257392: automatically sample from all FPCs on core routers up for grabs.
Jul 8 2020, 12:48 AM · Operations, netops
CDanis triaged T257392: automatically sample from all FPCs on core routers as Low priority.
Jul 8 2020, 12:48 AM · Operations, netops
CDanis added a parent task for T257392: automatically sample from all FPCs on core routers: Unknown Object (Task).
Jul 8 2020, 12:47 AM · Operations, netops
CDanis created T257392: automatically sample from all FPCs on core routers.
Jul 8 2020, 12:47 AM · Operations, netops

Jul 7 2020

CDanis edited P11793 Masterwork From Distant Lands.
Jul 7 2020, 5:21 PM

Jul 6 2020

CDanis created T257219: serve our production ssh known_hosts file over public HTTPS.
Jul 6 2020, 2:59 PM · Operations

Jul 5 2020

CDanis triaged T257128: every Sunday at 00:00 UTC, logrotate fails on netflow hosts as Low priority.
Jul 5 2020, 12:20 AM · netops, Operations
CDanis created T257128: every Sunday at 00:00 UTC, logrotate fails on netflow hosts.
Jul 5 2020, 12:20 AM · netops, Operations

Jul 3 2020

CDanis added a comment to T257066: Extension:Score / Lilypond is disabled on all wikis.

To be more precise, the error is:

Could not execute LilyPond: /dev/null is not an executable file. Make sure $wgScoreLilyPond is set correctly.

This could be a puppet change which changed the firejail command to something that doesn't work anymore.

Jul 3 2020, 4:56 PM · User-notice, Patch-For-Review, Security-Team, Security, Wikimedia-General-or-Unknown, MediaWiki-extensions-Score, Operations