Page MenuHomePhabricator
Feed Advanced Search

Feb 20 2024

mark added a comment to T357847: Requesting access to analytics-privatedata-users for sdeckelmann-wmf.

This is approved.

Feb 20 2024, 11:57 AM · SRE, SRE-Access-Requests

Sep 8 2023

mark added a comment to T345877: Requesting shell access, deployment and analytics-privatedata-users rights for acooper.

Approved.

Sep 8 2023, 11:16 AM · SRE-Access-Requests, SRE

Sep 7 2023

mark added a comment to T344509: Security Issue Access Request for (Kappakayala).

Approved.

Sep 7 2023, 10:47 AM · SecTeam-Processed, Security-Team, Security

Apr 19 2023

mark updated subscribers of T315426: Audit abuse filter wikireplica view rules.

@BTullis @odimitrijevic Given that this is an ongoing privacy leak, could we get some clarity on whether we can get this deployed soon, or how other teams may be able to help if needed?

Apr 19 2023, 11:39 AM · Data-Platform-SRE, Data-Engineering-Planning, SecTeam-Processed, Vuln-Infoleak, Data-Services, Security, Security-Team

Jul 18 2022

mark moved T313102: Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself from Inbox to Backlog on the SRE-swift-storage board.
Jul 18 2022, 12:13 PM · Patch-For-Review, SRE-swift-storage

Jul 11 2022

mark moved T256217: Swift sends ETAG without double-quotes from Inbox to Backlog on the SRE-swift-storage board.
Jul 11 2022, 1:35 PM · Wikimedia-Performance-recommendation, Traffic-Icebox, SRE-swift-storage, SRE, affects-Kiwix-and-openZIM
mark added a comment to T256217: Swift sends ETAG without double-quotes.

This appears to be configurable now in Swift 2.24.0 and later (we currently seem to be running 2.26.0 on 6/8 of frontends...), by enabling a piece of middleware and configuring RFC compliant ETag responses for specific Swift user accounts or containers:
https://docs.openstack.org/swift/latest/middleware.html#module-swift.common.middleware.etag_quoter

Jul 11 2022, 1:31 PM · Wikimedia-Performance-recommendation, Traffic-Icebox, SRE-swift-storage, SRE, affects-Kiwix-and-openZIM
mark added a comment to T256217: Swift sends ETAG without double-quotes.

I'm not sure since when, but based on us having <14 days ats-be storage, and based on there still beeing ETag headers on cached responses, I am guessing this is a pretty recent regression.

I'm finding that upload.wikimedia.org responses have neither ETag nor Last-Modified. This is also observed in T295556, but I'm skeptical of whether it is the same given the above caching, but perhaps these heades are both kept in Swift and still used post-Swift upgrade for pre-existing objects?

Jul 11 2022, 1:28 PM · Wikimedia-Performance-recommendation, Traffic-Icebox, SRE-swift-storage, SRE, affects-Kiwix-and-openZIM

Feb 24 2022

mark added a parent task for T292322: Support large files in Shellbox: T302430: <Tech Initiative> Commons Copy-by-URL Image Uploads Slowdown (Shellbox).
Feb 24 2022, 4:37 PM · MW-1.38-notes (1.38.0-wmf.21; 2022-02-07), SRE-swift-storage, Shellbox, serviceops, MW-on-K8s
mark added a subtask for T302430: <Tech Initiative> Commons Copy-by-URL Image Uploads Slowdown (Shellbox): T292322: Support large files in Shellbox.
Feb 24 2022, 4:37 PM · Foundational Technology Requests

Jul 26 2021

mark raised the priority of T263220: Limit concurrency of DPL queries from Medium to High.

Given that the underlying problem that this change might help with has already caused multiple full outages (all wikis affected) in the past year alone and the extension is deployed on quite a few wiki, I'd like to ask this to be looked into again for the near-term. Raising priority to 'high'. Would this be in scope for PET's Clinic Duty? How can SRE help?

Jul 26 2021, 12:31 PM · SRE-Sprint-Week-Sustainability-March2023, serviceops-radar, Wikimedia-Slow-DB-Query, SecTeam-Processed, Security, Vuln-DoS, Sustainability (Incident Followup), Platform Team Workboards (Clinic Duty Team), MW-1.36-notes (1.36.0-wmf.18; 2020-11-17), Performance Issue, DynamicPageList (Wikimedia)

Feb 19 2021

mark added a comment to T274459: Eqiad: 2 VM request for GitLab.

Whoa, catching up on scrollback overnight. My question is: is this the first anyone in SRE has heard about any of this?

Feb 19 2021, 10:18 AM · GitLab (Initialization), Patch-For-Review, User-brennen, vm-requests, SRE

Jan 22 2021

mark added a comment to T272686: print a list of backed up directories in the MOTD of production servers.

It's purely an idea I've had for a long time, to make it immediately obvious to anyone logging in what is backed up, and what isn't. That should help to:

Jan 22 2021, 11:43 AM · Data-Persistence-Backup, SRE

Oct 8 2020

mark added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

Hi all,

Oct 8 2020, 11:52 AM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic

Sep 4 2020

mark added a comment to T262042: Security Issue Access Request for LSobanski.

Approved.

Sep 4 2020, 1:44 PM · Security-Team, Security

Sep 1 2020

mark added a comment to T261760: Requesting access to Production for lsobanski.

Approved.

Sep 1 2020, 3:24 PM · SRE, SRE-Access-Requests
mark added a comment to T261626: Requesting access to Production for klausman.

Approved.

Sep 1 2020, 10:34 AM · SRE, SRE-Access-Requests

Aug 20 2020

mark updated subscribers of T260764: backup2001 RAID controller failure, unable to post 2020-08-19.

@wiki_willy @Papaul It seems we've had an ongoing pattern of crashes with this (rather important) backup host, which means we are not yet able to trust it. Until we are able to resolve this we also cannot decommission the older hosts (that this replaces) either. At the moment the system doesn't even boot. Are there any steps we can take soon to debug this issue? Anything we can help with? Thanks!

Aug 20 2020, 10:35 AM · SRE, ops-codfw

Jul 10 2020

mark added a comment to T256451: Security Issue Access Request for Kormat.

Approved.

Jul 10 2020, 9:17 AM · User-Kormat, Security-Team, Security

May 27 2020

mark added a project to T247028: Database 'INSERT' query rate doubled (module_deps regression?): Platform Team Workboards (Clinic Duty Team).
May 27 2020, 10:43 AM · MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), Sustainability (Incident Followup), Performance Issue, MediaWiki-ResourceLoader, Performance-Team

Apr 7 2020

mark added a project to T157651: sql.php must not run LoadExtensionSchemaUpdates: DBA.
Apr 7 2020, 12:11 PM · Sustainability (Incident Followup), MW-1.35-notes (1.35.0-wmf.30; 2020-04-28), Wikidata, Growth-Team, StructuredDiscussions, Platform Team Workboards (Clinic Duty Team), Patch-For-Review, Performance-Team, MediaWiki-Maintenance-system

Feb 21 2020

mark added a comment to T245520: 2*10G optics down on cr2-esams.

I am pretty sure there are a bunch of optics (of various kinds) in the "spare" switches, in the bottom of rack OE15. Unfortunately those switches are not powered up, and certainly not configured and remote manageable - something we should probably fix on next visit.

Feb 21 2020, 12:17 PM · ops-esams, netops, SRE

Feb 18 2020

mark added a comment to T245520: 2*10G optics down on cr2-esams.

There are multiple 10G LR optics on-site for sure. Longer distance ones, less so.

Feb 18 2020, 2:59 PM · ops-esams, netops, SRE

Feb 13 2020

mark added a comment to T245060: Pybal should reject a confctl configuration that indicates only one cp-text is pooled.

Personally I don't think Pybal should be rejecting that; it's a valid configuration from a technical standpoint, and there can be valid reasons to have it, at least temporarily. But we may decide that in our specific environment that should be avoided at all cost, so perhaps that logic should be implemented elsewhere - in the code that manages pooling state.

Feb 13 2020, 11:49 AM · SRE-Sprint-Week-Sustainability-March2023, Traffic, Traffic-Icebox, Sustainability (Incident Followup), PyBal

Feb 12 2020

mark added a comment to T236437: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet.

@wiki_willy With Chris having been ill the past few days, what's a realistic new ETA for this?

Feb 12 2020, 4:47 PM · serviceops, SRE

Dec 10 2019

mark added a comment to T238909: Proposal: simplify set up of a new load-balanced service on kubernetes.
Dec 10 2019, 12:02 PM · SRE, Prod-Kubernetes, PyBal, Traffic, serviceops
mark added a comment to T238909: Proposal: simplify set up of a new load-balanced service on kubernetes.

I agree - it seems that PyBal adds no real value here, because it's essentially load balancing the k8s load balancers. Why couldn't our caching layer do that directly, and know about all the k8s proxies/nodes directly and do health checks for them?

Dec 10 2019, 11:43 AM · SRE, Prod-Kubernetes, PyBal, Traffic, serviceops

Nov 28 2019

mark moved T237041: wipe backup-array1 from Backlog to Blocked on the ops-esams board.
Nov 28 2019, 11:34 AM · ops-esams, SRE
mark moved T174637: Setup esams atlas anchor from Racking Tasks to Blocked on the ops-esams board.
Nov 28 2019, 11:34 AM · SRE, netops, ops-esams

Nov 27 2019

mark added a comment to T184066: rack/setup/install ps[12]-oe1[456]-esams.

qfx5100-spare1, psu 0 {#20156} to ps2-oe15-esams:17
qfx5100-spare2, psu 0 {#20157} to ps2-oe15-esams:16
qfx5100-spare1, psu 1 {#20159} to ps1-oe15-esams:2
qfx5100-spare2, psu 1 {#20158} to ps1-oe15-esams:3
asw2-oe16-esams:psu0 {#20162} to ps2-oe16-esams:26
asw2-oe16-esams:psu1 {#20164} to ps1-oe16-esams:26

All the above are done, but NOT

scs1-oe15-esams:psu1 {#20163} to ps2-oe15-esams:34
scs1-oe15-esams:psu2 {#20164} to ps1-oe15-esams:34

as there is no scs-oe15-esams, not sure what that is. Mark's comment T184066#5694430 covers scs-oe16-esams.

Nov 27 2019, 11:05 AM · SRE, ops-esams

Nov 26 2019

mark added a comment to T184066: rack/setup/install ps[12]-oe1[456]-esams.

scs1-oe16-esams:psu1 {#20163} to ps2-oe16-esams:34
scs1-oe16-esams:psu2 {#20164} to ps1-oe16-esams:34

Nov 26 2019, 5:54 PM · SRE, ops-esams
mark closed T238835: apply asset tags to cable managers as Resolved.
Nov 26 2019, 4:39 PM · SRE, ops-esams
mark moved T237009: Add missing labels for equipment and cables from Procurement to Blocked on the ops-esams board.
Nov 26 2019, 4:38 PM · DC-Ops, ops-esams, SRE
mark updated the task description for T237009: Add missing labels for equipment and cables.
Nov 26 2019, 4:12 PM · DC-Ops, ops-esams, SRE
mark added a comment to T237009: Add missing labels for equipment and cables.

cr3-esams now has its power cables labeled:

Nov 26 2019, 4:11 PM · DC-Ops, ops-esams, SRE
mark added a comment to T237009: Add missing labels for equipment and cables.

cr2-esams now has its power cables labeled:

Nov 26 2019, 3:58 PM · DC-Ops, ops-esams, SRE
mark closed T237006: Relabel cables with duplicate IDs as Resolved.

All duplicate ids have been fixed, labels replaced for one pair and updated in netbox.

Nov 26 2019, 3:00 PM · SRE, ops-esams
mark updated the task description for T237009: Add missing labels for equipment and cables.
Nov 26 2019, 2:36 PM · DC-Ops, ops-esams, SRE
mark added a comment to T237009: Add missing labels for equipment and cables.

I've filled out all red cells in the (original) bootstrap spreadsheet.

Nov 26 2019, 2:36 PM · DC-Ops, ops-esams, SRE
mark added a comment to T238835: apply asset tags to cable managers.

All 7 cable managers have been asset tagged and put into Netbox with the appropriate info and rack position.

Nov 26 2019, 2:26 PM · SRE, ops-esams
mark added a comment to T237009: Add missing labels for equipment and cables.

All SERVER power cords have been audited in this sheet: https://docs.google.com/spreadsheets/d/1RMb6lMCc94wUj6MgSm1yYdnAC3SUsZIRj8zHLtxRx4o/edit?usp=sharing

Nov 26 2019, 1:26 PM · DC-Ops, ops-esams, SRE
mark updated the task description for T237009: Add missing labels for equipment and cables.
Nov 26 2019, 1:25 PM · DC-Ops, ops-esams, SRE
mark closed T237014: Update spare QFX labels as Resolved.

Done.

Nov 26 2019, 10:15 AM · ops-esams, SRE

Nov 25 2019

mark updated the task description for T237030: Setup new MX204 in knams.
Nov 25 2019, 6:14 PM · netops, SRE, ops-esams
mark updated the task description for T237030: Setup new MX204 in knams.
Nov 25 2019, 5:43 PM · netops, ops-esams, SRE

Nov 4 2019

mark moved T234450: Special:Contributions requests with a high &limit= caused excessive database load from Done to Discussing on the Platform Team Workboards (Clinic Duty Team) board.

CPT: please take a new look, thanks :)

Nov 4 2019, 5:17 PM · User-notice-archive, MW-1.31-release-notes, MW-1.33-notes, MW-1.34-notes, Platform Engineering, Security, MW-1.35-notes (1.35.0-wmf.5; 2019-11-05), Vuln-DoS, Performance Issue, MediaWiki-Special-pages, Wikimedia-production-error

Oct 24 2019

mark added a comment to T232887: The phabricator server, WMF7426, was given to us temporarily, we would like to make it permanent.

I'm a bit confused; as far as I know the old plan was always to have HA of Phabricator between eqiad and codfw, and the linked task T190572 also talks about that. So is that no longer the case, and if so, why is that? I believe there have been blockers & complications for that deployment, but are they documented anywhere? How does this task relate to those plans, why do we feel failover within eqiad is (also) needed?

Oct 24 2019, 3:57 PM · SRE, hardware-requests, Release-Engineering-Team (Development services), serviceops, Phabricator

Oct 22 2019

mark added projects to T234450: Special:Contributions requests with a high &limit= caused excessive database load: Platform Engineering, Platform Team Workboards (Clinic Duty Team).

Could CPT take a look at this please? Thanks!

Oct 22 2019, 9:54 AM · User-notice-archive, MW-1.31-release-notes, MW-1.33-notes, MW-1.34-notes, Platform Engineering, Security, MW-1.35-notes (1.35.0-wmf.5; 2019-11-05), Vuln-DoS, Performance Issue, MediaWiki-Special-pages, Wikimedia-production-error

Sep 17 2019

mark added a comment to T231387: Updating DNS records (pr.wikimedia.org).

What's the status of this? Is this done and working?

Sep 17 2019, 12:25 PM · Mail, WMF-Communications, SRE

Sep 12 2019

mark added a project to T231387: Updating DNS records (pr.wikimedia.org): Mail.
Sep 12 2019, 12:55 PM · Mail, WMF-Communications, SRE
mark updated subscribers of T231387: Updating DNS records (pr.wikimedia.org).

@mark - Thank you very much for that thoughtful and helpful reply!

Talking it over, we would like to try the first option if you believe that will work.

So how do we go about getting this setup?

Anusha Alikhan
aalikhan@pr.wikimedia.org

Samantha Lien
slien@pr.wikimedia.org

Sep 12 2019, 12:54 PM · Mail, WMF-Communications, SRE

Sep 5 2019

mark changed the status of T231387: Updating DNS records (pr.wikimedia.org) from Stalled to Open.
Sep 5 2019, 2:32 PM · Mail, WMF-Communications, SRE
mark added a comment to T231387: Updating DNS records (pr.wikimedia.org).

Hi Anusha, Greg,

Sep 5 2019, 2:32 PM · Mail, WMF-Communications, SRE

Aug 9 2019

mark added a comment to T229755: csw2-esams's VCP link flapped.

EX4200 can also have any port converted as VC - just won't be as fast, max 10Gbps.

Aug 9 2019, 10:02 AM · SRE, netops

Aug 6 2019

mark added a comment to T229860: SRE Onboarding for Sukhbir Singh.

Approved for access.

Aug 6 2019, 11:13 AM · SRE-Access-Requests, Traffic, SRE

Jul 23 2019

mark raised the priority of T228720: stub for enwiki broken, attempt to load content for bad rev during sha1 retrieval from High to Unbreak Now!.

Because this means that right now stub dumps generation for (at least) enwiki and dewiki and several other is broken, we have only a few days to fix this before the dumps need to be done at the end of the month. Setting UBN...

Jul 23 2019, 1:40 PM · Platform Team Initiatives (MCR), MW-1.34-notes (1.34.0-wmf.14; 2019-07-16), Dumps-Generation

Apr 16 2019

mark renamed T218570: DB planning: include a writeable (?) misc DB cluster in codfw for WMCS from DB planning: include a misc cluster in codfw to DB planning: include a writeable (?) misc DB cluster in codfw for WMCS.
Apr 16 2019, 10:43 AM · DBA, cloud-services-team (Kanban)

Apr 5 2019

mark updated the task description for T219805: Investigate Doctrine DBAL usage possibility.
Apr 5 2019, 11:13 AM · User-Addshore, Wikidata-Trailblazing-Exploration, Wikidata, TechCom, Patch-For-Review
mark added a comment to T219805: Investigate Doctrine DBAL usage possibility.

While I agree with Daniel and others that the use of the MediaWiki db connection/load balancing layer is an absolute minimum requirement, there are quite a few other potential problems that could affect the security/privacy, reliability or maintainability of our data and services, if Doctrine is to be used to access MediaWiki's existing databases in any way (it's definitely easier if done in separate, not connected database clusters). However this ticket so far is very sparse on details, and we don't have the information we need to make an informed decision. I've requested access to the linked document yesterday, but so far it wasn't granted yet. Alternatively, could this perhaps be replicated here on Phabricator so everyone involved can build an informed opinion? Thanks. :)

Apr 5 2019, 11:13 AM · User-Addshore, Wikidata-Trailblazing-Exploration, Wikidata, TechCom, Patch-For-Review

Apr 1 2019

mark added a project to T190379: RFC: Re-establish the development policies: DBA.

There has been some concern from our DBAs the archiving of the old policy will make it even harder for developers to find out about what database-related requirements their code should fulfill, and what the processes would be to get any schema or query changes deployed (such as a link to the Schema_changes page). The old information on database related requirements, while admittedly a bit outdated, was discussed as an RFC at the time: https://www.mediawiki.org/wiki/Architecture_meetings/RFC_review_2015-09-16

Apr 1 2019, 1:08 PM · DBA, Performance-Team, TechCom-RFC (TechCom-RFC-Closed), TechCom

Mar 22 2019

Effie Mouzeli <effie@wikimedia.org> committed rMSCA5e1eced094fe: Add unit testing of scap main.py (authored by mark).
Add unit testing of scap main.py
Mar 22 2019, 11:33 AM

Mar 21 2019

Mill <mill@mail.com> committed rMSCA135f64c71c56: 3%5eaaaaaaaaaaaa (authored by mark).
3%5eaaaaaaaaaaaa
Mar 21 2019, 12:11 AM

Mar 6 2019

Effie Mouzeli <effie@wikimedia.org> committed rMSCA8d204fe0b7a9: WiP: Add unit testing of scap main.py (authored by mark).
WiP: Add unit testing of scap main.py
Mar 6 2019, 7:37 PM
Effie Mouzeli <effie@wikimedia.org> committed rMSCA2ab9d6f3e4d9: WiP: Add unit testing of scap main.py (authored by mark).
WiP: Add unit testing of scap main.py
Mar 6 2019, 6:27 PM
Effie Mouzeli <effie@wikimedia.org> committed rMSCAa7a532cb535f: WiP: Add unit testing of scap main.py (authored by mark).
WiP: Add unit testing of scap main.py
Mar 6 2019, 5:51 PM
Effie Mouzeli <effie@wikimedia.org> committed rMSCA705d3be59ec8: WiP: Add unit testing of scap main.py (authored by mark).
WiP: Add unit testing of scap main.py
Mar 6 2019, 4:04 PM

Feb 22 2019

mark committed rMSCAd624470dbe89: WiP: Add unit testing of scap main.py.
WiP: Add unit testing of scap main.py
Feb 22 2019, 5:27 PM
mark committed rMSCA3b376098e5fa: WiP: Add unit testing of scap main.py.
WiP: Add unit testing of scap main.py
Feb 22 2019, 5:27 PM

Feb 5 2019

mark removed a watcher for ops-codfw: mark.
Feb 5 2019, 2:37 PM

Jan 23 2019

mark added a comment to T211254: Free up 185.15.59.0/24.

It's the same basic rationale as moving WMCS out of 10.68.0.0/16. We could obviously leave them there and just manage our ACLs better with more automation, but it pays some pretty big dividends when address spaces are clearly split on such a big security and functional boundary as Prod-v-WMCS. Humans will always look at IPs as well in various debugging and configuration tasks. Having similar/shared/adjacent numbering for these two realms invites confusion and mistakes.

In a world where there's ample address space (such as 10/8 in our context), yes. In today's world where IPv4 address space is scarce and we can likely not get any more, not so much.

I would personally have preferred that with the renumbering.of WMCS they simply acquired new public IPv4 space of their own

That's simply not realistic, they can't "acquire" IPv4 address space of their own. They're part of this organisation, this ASN, and need to use our PI/PA space where we have it available before we collectively can get more.

I understand the basic concerns here about exhaustion and how the process works. I think it would've been possible to find a way to ask for new or acquire new space though, even in the US. It's just a process and a cost at the end of the day.

Jan 23 2019, 3:16 PM · Infrastructure-Foundations, Patch-For-Review, Traffic, SRE, netops
mark added a comment to T211730: Replace accepted-prefix-limit with prefix-limit.

Yes, we should probably move over to prefix-limit to prevent (improving) filters from making accepted-prefix-limit ineffective.

Jan 23 2019, 2:23 PM · netops, SRE
mark added a comment to T211254: Free up 185.15.59.0/24.

It's the same basic rationale as moving WMCS out of 10.68.0.0/16. We could obviously leave them there and just manage our ACLs better with more automation, but it pays some pretty big dividends when address spaces are clearly split on such a big security and functional boundary as Prod-v-WMCS. Humans will always look at IPs as well in various debugging and configuration tasks. Having similar/shared/adjacent numbering for these two realms invites confusion and mistakes.

Jan 23 2019, 1:55 PM · Infrastructure-Foundations, Patch-For-Review, Traffic, SRE, netops
mark added a comment to T211728: Outbound BGP graceful shutdown.

Have a look at https://github.com/mwiget/bgp_graceful_shutdown for a JunOS op script (SLAX) that does this fully automatically for all peers with a single command.

Jan 23 2019, 1:38 PM · Infrastructure-Foundations, SRE, netops
mark closed T186021: reconfigure esams switch port for new bastion as Declined.

This was solved by fixing the original bastion, a while ago.

Jan 23 2019, 1:22 PM · ops-esams, netops, SRE
mark closed T186021: reconfigure esams switch port for new bastion, a subtask of T184936: install/designate other machine as esams bastion, as Declined.
Jan 23 2019, 1:22 PM · SRE, ops-esams
mark added a comment to T211254: Free up 185.15.59.0/24.

I really don't see the point of this. With the scarcity of IPv4 space we only need to get MORE flexible about how we use our IP space, and we will almost certainly not be able to maintain production vs others split between these address blocks in the future. Rather than spend time on renumbering I think it's much more valuable to spend that effort on better managing our ACLs and more automation.

Jan 23 2019, 1:12 PM · Infrastructure-Foundations, Patch-For-Review, Traffic, SRE, netops

Jan 11 2019

mark reopened T148541: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring as "Open".
Jan 11 2019, 3:13 PM · User-fgiunchedi, Patch-For-Review, Prometheus-metrics-monitoring, SRE, observability
mark raised the priority of T148541: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring from Medium to High.

Right now I can only find a single graph with eqiad/codfw total (aggregated) power usage, but proper per-rack power usage data is still entirely missing. This makes it currently very hard to determine the total amount of power used per rack (across all phases) and to monitor things like phase imbalance.

Jan 11 2019, 3:13 PM · User-fgiunchedi, Patch-For-Review, Prometheus-metrics-monitoring, SRE, observability

Dec 19 2018

mark added a comment to T212129: Move MainStash out of Redis to a simpler multi-dc aware solution.

I am getting the impression here that some things are being rushed and finalized without time for a proper discussion between people/teams about the different possible solutions and their impact, after this new discovery. Is that because goals are due to be posted now?

Dec 19 2018, 3:05 PM · MW-1.39-notes (1.39.0-wmf.16; 2022-06-13), MW-1.38-notes (1.38.0-wmf.20; 2022-01-31), Performance-Team, Sustainability (MediaWiki-MultiDC), MediaWiki-General, serviceops-radar, User-mobrovac, User-jijiki, SRE

Oct 12 2018

mark moved T199677: cp3033 unreacheable since 2018-07-15 11:47:31 from Backlog to Hardware Failure / Repair on the ops-esams board.
Oct 12 2018, 2:51 PM · ops-esams, SRE, Traffic

Sep 18 2018

mark reassigned T201470: Add contint-roots to releases{1,2}001 from mark to RobH.
Sep 18 2018, 11:35 AM · Patch-For-Review, Release-Engineering-Team (Watching / External), SRE-Access-Requests, SRE
mark updated subscribers of T201470: Add contint-roots to releases{1,2}001.

Although we didn't manage to discuss this in our SRE meeting yesterday I discussed it with relevant people afterwards.

Sep 18 2018, 11:35 AM · Patch-For-Review, Release-Engineering-Team (Watching / External), SRE-Access-Requests, SRE

Sep 11 2018

mark added a comment to T204083: wikibase_shared/<current_train_version>-wikidatawiki-hhvm:CacheAwarePropertyInfoStore memcached key not well distributed, causing excessive traffic.

T97368 appears to be about the same issue.

Sep 11 2018, 9:17 PM · [DEPRECATED] wdwb-tech, Performance-Team, SRE, wikiba.se website, Wikidata
mark added projects to T203039: Storage of data for recommendation API: DBA, SRE.
Sep 11 2018, 4:39 PM · Analytics, SRE, DBA, Services (designing), Research
mark added projects to T204026: DBPerformance warning "Query returned XXXX rows: query: SELECT * FROM `translate_metadata`": DBA, SRE.
Sep 11 2018, 1:31 PM · MW-1.38-notes (1.38.0-wmf.4; 2021-10-12), Wikimedia-Slow-DB-Query, Language-Team (Language-2021-October-December), MW-1.37-notes (1.37.0-wmf.23; 2021-09-13), MediaWiki-extensions-CentralNotice, Wikimedia-Fundraising, Performance-Team (Radar), Datacenter-Switchover, Wikimedia-production-error, SRE, MediaWiki-extensions-Translate
mark added a comment to T203674: Debian package or files managed my puppet for pt-kill-wmf.

Indeed, let's go with a "proper" Debian package, imho the cleanest way to go and conforming to how we do things.

Sep 11 2018, 10:18 AM · User-Banyek, Puppet, SRE

Sep 3 2018

mark added a comment to T203182: Requesting access to EventLogging in Hive (analytics-privatedata-users) for Cicalese.

Yes, this can be merged once Nuria approves.

Sep 3 2018, 10:35 AM · Patch-For-Review, SRE, SRE-Access-Requests

Aug 14 2018

mark updated subscribers of T201856: Subscribe user mepps to security@wikimedia.org.

@Dzahn please get her added to this list. Thanks!

Aug 14 2018, 5:36 PM · SRE, SRE-Access-Requests

Aug 13 2018

mark updated the task description for T201694: Move servers off asw2-a-eqiad.
Aug 13 2018, 9:08 AM · Patch-For-Review, SRE, netops

Aug 10 2018

mark added a comment to T200297: Review Jade data storage and architecture proposal [RFC].

I talked to @mark today. Here's what I understood from the conversation:

  1. All of the following points assume that the TechCom discussion happens and there's a decision that the local-wiki JADE namespace is the only reasonable implementation strategy
  2. Large wikis (enwiki, wikidatawiki, and commonswiki) are where the concerns exist. All other, smaller wikis are less of a concern.
  3. The revision table is the only table that is a serious concern for large wikis. The page table is less of a concern.
  4. Our estimated growth of 0.5M new revisions per large wiki per year is acceptable growth.
  5. In order to account for fluctuations, a ceiling of 1M new revisions per large wiki per year is acceptable for JADE judgments.
Aug 10 2018, 12:37 PM · TechCom-RFC (TechCom-RFC-Closed), MW-1.33-notes (1.33.0-wmf.14; 2019-01-22), Patch-For-Review, Machine-Learning-Team (Active Tasks), DBA, SRE, Jade

Jul 30 2018

mark added a comment to T200297: Review Jade data storage and architecture proposal [RFC].

I am a bit confused by this RFC/proposal as it stands now, as I feel it doesn't really reflect the discussions we've been having.

Jul 30 2018, 2:27 PM · TechCom-RFC (TechCom-RFC-Closed), MW-1.33-notes (1.33.0-wmf.14; 2019-01-22), Patch-For-Review, Machine-Learning-Team (Active Tasks), DBA, SRE, Jade

Jul 25 2018

mark added a comment to T195923: rack/setup/install cp1075-cp1090.

@ayounsi I was not able to add the ports in row A to the public vlan. Can you check the following and add to public vlan. Also, adding the servers to the vlan in the other rows did not automatically enable the ports. Can you also check please.

Jul 25 2018, 12:43 PM · Patch-For-Review, ops-eqiad, Traffic, SRE
mark added a comment to T168539: Unhandled pybal error: OpenSSL.SSL.Error - ssl handshake failure.

@ema: Has this been seen again? Does this need any work in Pybal?

Jul 25 2018, 11:07 AM · Traffic, PyBal, SRE
mark moved T113597: pybal-related issue on host start can break service IPs... from Backlog to Blocked on the PyBal board.
Jul 25 2018, 10:58 AM · Traffic-Icebox, SRE, PyBal
mark moved T114104: pybal doesn't fully manage LVS table leaving stale services (on IP change) from Backlog to Blocked on the PyBal board.
Jul 25 2018, 10:58 AM · Traffic-Icebox, SRE, PyBal
mark moved T86650: Add support for setting weight=0 when depooling from Backlog to Blocked on the PyBal board.
Jul 25 2018, 10:56 AM · Traffic-Icebox, SRE, PyBal
mark moved T114979: Run IPVS in a separate network namespace from Backlog to Blocked on the PyBal board.
Jul 25 2018, 10:56 AM · Traffic-Icebox, SRE, PyBal
mark moved T172124: PyBal Feature: progressive depooling strategy for monitored failures from Backlog to Blocked on the PyBal board.
Jul 25 2018, 10:56 AM · Traffic-Icebox, PyBal, SRE
mark created T200319: Migrate Pybal to Python 3.
Jul 25 2018, 10:55 AM · User-Ladsgroup, Python3-Porting, PyBal
mark added a comment to T200277: OSPF metrics.

The eqdfw-knams needs have a lower metric than the current primary (codfw-eqiad + eqiad-esams) links so traffic from codfw to esams prefer that link.

Jul 25 2018, 8:25 AM · Infrastructure-Foundations, SRE, netops