Page MenuHomePhabricator

MatthewVernon (Matthew Vernon)
SRE (Data Persistence)

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Aug 2 2021, 1:52 PM (86 w, 5 d)
Availability
Available
IRC Nick
Emperor
LDAP User
MVernon
MediaWiki User
MVernon (WMF) [ Global Accounts ]

Recent Activity

Fri, Mar 31

MatthewVernon added a comment to T332883: Add-in Card 2 ROMB Battery LOW.

@Jclark-ctr ms-be1042 shut down ready for you.

Fri, Mar 31, 11:43 AM · SRE, ops-eqiad, DC-Ops, SRE-swift-storage, Analytics-Radar
MatthewVernon added a comment to T332883: Add-in Card 2 ROMB Battery LOW.

@MatthewVernon 1300 utc will be on site to change battery

Fri, Mar 31, 8:57 AM · SRE, ops-eqiad, DC-Ops, SRE-swift-storage, Analytics-Radar
MatthewVernon added a comment to T332883: Add-in Card 2 ROMB Battery LOW.

@Jclark-ctr I can shut down ms-be1042 for you (or you can DIY, there's no special procedure for this host). Can I confirm you want it shut down at 14:00 UTC today, please?

Fri, Mar 31, 8:13 AM · SRE, ops-eqiad, DC-Ops, SRE-swift-storage, Analytics-Radar

Wed, Mar 29

MatthewVernon added a comment to T333042: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster).

I think in these cases, removing the incorrect thumbnail will allow it to be recreated on next GET.

Wed, Mar 29, 9:12 AM · SRE, Traffic, Commons
MatthewVernon added a comment to T332983: Two failed disks in ms-be2067.

Hi folks. I am happy for you to do the firmware update first if you think that's the best approach.
Please do so at your earliest convenience - a note here when you're doing so would be appreciated. I'm happy for that to involve rebooting the server.

Wed, Mar 29, 8:13 AM · SRE, SRE-swift-storage, ops-codfw, DC-Ops

Tue, Mar 28

MatthewVernon updated the task description for T331882: eqiad row C switches upgrade.
Tue, Mar 28, 4:36 PM · Patch-For-Review, serviceops-radar, Discovery-Search (Current work), SRE, DBA, cloud-services-team, Traffic, Infrastructure-Foundations, Machine-Learning-Team, Data-Engineering, serviceops-collab, Platform Engineering, SRE Observability
MatthewVernon updated the task description for T331882: eqiad row C switches upgrade.
Tue, Mar 28, 4:36 PM · Patch-For-Review, serviceops-radar, Discovery-Search (Current work), SRE, DBA, cloud-services-team, Traffic, Infrastructure-Foundations, Machine-Learning-Team, Data-Engineering, serviceops-collab, Platform Engineering, SRE Observability
MatthewVernon updated the task description for T333377: eqiad row D switches upgrade.
Tue, Mar 28, 4:13 PM · SRE, cloud-services-team, Data-Persistence, Data-Engineering, Traffic, Machine-Learning-Team, serviceops-collab, Discovery-Search, Infrastructure-Foundations, Platform Engineering, SRE Observability
MatthewVernon updated the task description for T330165: eqiad row B switches upgrade.
Tue, Mar 28, 1:55 PM · Patch-For-Review, Data Pipelines, Data-Engineering-Planning, DBA, Discovery-Search (Current work), SRE, serviceops, cloud-services-team, Machine-Learning-Team, Platform Engineering, SRE Observability, Infrastructure-Foundations, serviceops-collab, Traffic
MatthewVernon added a comment to T332983: Two failed disks in ms-be2067.

The firmware update cookbook does offer a firmware update; I was going to apply it once the disks were swapped (as rebooting the system with drives in a funny state doesn't always go smoothly).

ms-be2067.codfw.wmnet (Gen 14): starting
ms-be2067.codfw.wmnet (IDRAC): update
ms-be2067.codfw.wmnet (IDRAC): current version: 5.0.20.0
poweredge-r740xd2: picking DellDriverCategory.IDRAC update file
We have found multiple entries please pick from the list below:
0: /srv/firmware/poweredge-r740xd2/IDRAC/iDRAC-with-Lifecycle-Controller_Firmware_C8NT1_WN64_6.10.30.00_A00.EXE
1: Download new file
==> Please select the entry you want
> 0
User input is: "0"
ms-be2067.codfw.wmnet (IDRAC): target_version: 6.10.30.0, current_version: 5.0.20.0
==> ms-be2067.codfw.wmnet IDRAC: About to upload /srv/firmware/poweredge-r740xd2/IDRAC/iDRAC-with-Lifecycle-Controller_Firmware_C8NT1_WN64_6.10.30.00_A00.EXE, please confirm
Type "go" to proceed or "abort" to interrupt the execution
> abort
User input is: "abort"
Tue, Mar 28, 8:36 AM · SRE, SRE-swift-storage, ops-codfw, DC-Ops

Fri, Mar 24

MatthewVernon triaged T332983: Two failed disks in ms-be2067 as High priority.
Fri, Mar 24, 12:04 PM · SRE, SRE-swift-storage, ops-codfw, DC-Ops
MatthewVernon created T332983: Two failed disks in ms-be2067.
Fri, Mar 24, 12:04 PM · SRE, SRE-swift-storage, ops-codfw, DC-Ops

Wed, Mar 22

MatthewVernon added a comment to T269108: Create a read-only swift identity for backup taking.

I wonder (but this is not a settled position) whether using an account ACL is the more elegant solution, as we do that once and it'll work for all deleted containers?
The upstream docs are a bit ... spartan, swiftstack provide slightly more information.

Wed, Mar 22, 10:56 AM · Data-Persistence-Backup, SRE, SRE-swift-storage
MatthewVernon added a comment to T269108: Create a read-only swift identity for backup taking.

Yeah, that's a good question - I think there are about 21675 deleted containers.
I think there's no automation for container management (is that right @fgiunchedi ?) so the options are presumably to write a loop that adds the mw:backup account to the read ACL for each deleted container, or maybe to make a read-only account ACL so the mw:backup user has read-only access to everything the mw:media account owns?

Wed, Mar 22, 10:44 AM · Data-Persistence-Backup, SRE, SRE-swift-storage
MatthewVernon added a comment to T269108: Create a read-only swift identity for backup taking.

I think the issue is that the deleted container has different permissions:

root@ms-fe1009:/home/mvernon# swift stat wikipedia-mediawiki-local-deleted | grep 'Read ACL'
        Read ACL: mw:thumbor-private,mw:media
root@ms-fe1009:/home/mvernon# swift stat wikipedia-commons-local-public.57 | grep 'Read ACL'
        Read ACL: mw:thumbor,mw:media,.r:*

So you can see the deleted container is only available to the mw:thumbor-private and mw:media accounts, whereas the public container is readable by any swift account.

Wed, Mar 22, 10:26 AM · Data-Persistence-Backup, SRE, SRE-swift-storage

Tue, Mar 21

MatthewVernon assigned T256972: Refactor mariadb puppet code to Kormat.
Tue, Mar 21, 10:04 AM · DBA, User-jbond, User-Kormat
MatthewVernon updated Other Assignee for T256972: Refactor mariadb puppet code, added: Kormat.
Tue, Mar 21, 10:04 AM · DBA, User-jbond, User-Kormat
MatthewVernon updated subscribers of T256972: Refactor mariadb puppet code.

@Aklapper Kormat has been on leave for some time now; could you hold off unassigning their tasks, please, or at least co-ordinate with our team leader @KOfori first?

Tue, Mar 21, 10:03 AM · DBA, User-jbond, User-Kormat
MatthewVernon added a comment to T317358: 2022-08-24 swift incident (tracking).

"Write up an incident report", perhaps?

Tue, Mar 21, 9:58 AM · SRE-Sprint-Week-Sustainability-March2023, Data-Persistence, SRE-swift-storage, Sustainability (Incident Followup)

Mon, Mar 20

MatthewVernon added a comment to T331067: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for AranyaP.

I think the right thing would be to open a new ticket; but I note it's SRE Sprint Week, so I'm not sure whether clinic duty tasks will get much attention this week unless flagged as urgent, I'm afraid.

Mon, Mar 20, 8:18 PM · SRE, SRE-Access-Requests
MatthewVernon closed T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw as Resolved.
Mon, Mar 20, 2:22 PM · Wikimedia-production-error, SRE-swift-storage, Commons
MatthewVernon added a comment to T156136: Increase swift replication factor for accounts.

I'm not sure how much this will actually help from the swift side (as opposed to frontend capacity, memcached, etc).
The account dbs are tiny, though, so it seems like a cheap thing to try...

Mon, Mar 20, 11:45 AM · SRE-Sprint-Week-Sustainability-March2023, Data-Persistence, Sustainability (Incident Followup), SRE-swift-storage

Fri, Mar 17

MatthewVernon added a comment to T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw.

I think this issue is resolved now? Swift error rates have remained low since we restarted swift and thumbor.

Fri, Mar 17, 4:34 PM · Wikimedia-production-error, SRE-swift-storage, Commons
MatthewVernon closed T331820: Upstream caches: 404 as Resolved.

I'm closing this task now, as I think thumbnails are now being correctly generated.

Fri, Mar 17, 4:33 PM · SRE, Thumbor, SRE-swift-storage, Commons
MatthewVernon added a project to T332210: Thumbor 404s on an auth failure to Swift: SRE-swift-storage.
Fri, Mar 17, 4:30 PM · SRE-swift-storage, Platform Team Workboards (Platform Engineering Reliability), Thumbor
MatthewVernon updated the task description for T330165: eqiad row B switches upgrade.
Fri, Mar 17, 12:08 PM · Patch-For-Review, Data Pipelines, Data-Engineering-Planning, DBA, Discovery-Search (Current work), SRE, serviceops, cloud-services-team, Machine-Learning-Team, Platform Engineering, SRE Observability, Infrastructure-Foundations, serviceops-collab, Traffic
MatthewVernon added a comment to T331030: Two failed disks in ms-be2067.

Thanks!

Fri, Mar 17, 11:03 AM · SRE, DC-Ops, SRE-swift-storage, ops-codfw

Thu, Mar 16

MatthewVernon added a comment to T331030: Two failed disks in ms-be2067.

@Papaul that's the one - can you clear the Foreign state from that disk, please? I can't figure out how to do it, and I think without that config being cleared it's not available to the OS.

Thu, Mar 16, 4:45 PM · SRE, DC-Ops, SRE-swift-storage, ops-codfw
MatthewVernon closed T331860: Urgent: Two failed disks in ms-be2040 as Resolved.

@Papaul thanks; the other drive has behaved itself since the reboot, so I think we're OK to leave it in place for now.

Thu, Mar 16, 2:39 PM · ops-codfw, SRE-swift-storage, SRE, DC-Ops

Wed, Mar 15

MatthewVernon added a comment to T330693: Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing.

This is a k8s application running on the WMF OpenStack, yes?

Wed, Mar 15, 3:38 PM · Event-Platform Value Stream (Sprint 11), Data-Engineering-Planning, SRE-swift-storage

Tue, Mar 14

MatthewVernon added a comment to T326846: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004.

@Cmjohnson these are just JBOD I think - at least that's how ms-fe1012 appears to me (and I think that's what I expect for a swift frontend) - we do software-raid on these systems.

Tue, Mar 14, 10:49 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon added a comment to T331820: Upstream caches: 404.

Some logs (NDA, they have IPs in) for a particular request above - P45869.

Tue, Mar 14, 10:43 PM · SRE, Thumbor, SRE-swift-storage, Commons
MatthewVernon added a comment to T331820: Upstream caches: 404.

I've had this intermittently today as well — https://upload.wikimedia.org/wikipedia/commons/thumb/f/f9/Octicons-gift.svg/12px-Octicons-gift.svg.png being the most recent example

Tue, Mar 14, 10:09 PM · SRE, Thumbor, SRE-swift-storage, Commons
MatthewVernon added a comment to T331030: Two failed disks in ms-be2067.

...this may be a different disk that's failed, so maybe it just got unseated?

Tue, Mar 14, 5:09 PM · SRE, DC-Ops, SRE-swift-storage, ops-codfw
MatthewVernon added a comment to T317358: 2022-08-24 swift incident (tracking).

[for reference, I was on leave on 2022-08-24]

Tue, Mar 14, 5:06 PM · SRE-Sprint-Week-Sustainability-March2023, Data-Persistence, SRE-swift-storage, Sustainability (Incident Followup)
MatthewVernon reopened T331030: Two failed disks in ms-be2067 as "Open".

Hi @Papaul Sorry, but there's still a missing disk in this system - Virtual Drive 21 is absent, which I think is Slot Number: 19
Could you have another look, please?

Tue, Mar 14, 5:05 PM · SRE, DC-Ops, SRE-swift-storage, ops-codfw
MatthewVernon added a comment to T331860: Urgent: Two failed disks in ms-be2040.

@Papaul that's only 13 disks, not 14?
The recent activity panel in the iDRAC shows:

2023-03-12T15:08:42-0500	Virtual Disk 8 on Integrated RAID Controller 1 was deleted.
2023-03-12T15:08:37-0500	The Patrol Read operation was stopped and did not complete for Integrated RAID Controller 1.
2023-03-12T15:08:37-0500	Controller cache is preserved for missing or offline Virtual Disk 8 on Integrated RAID Controller 1.
2023-03-12T15:08:37-0500	Disk 6 in Backplane 1 of Integrated RAID Controller 1 is removed.
2023-03-12T15:08:37-0500	Disk 6 in Backplane 1 of Integrated RAID Controller 1 was reset.
2023-03-12T15:08:37-0500	Disk 6 in Backplane 1 of Integrated RAID Controller 1 is not functioning correctly. 
2023-03-12T15:08:26-0500	Disk 6 in Backplane 1 of Integrated RAID Controller 1 was reset.
Tue, Mar 14, 10:04 AM · ops-codfw, SRE-swift-storage, SRE, DC-Ops

Mon, Mar 13

MatthewVernon added a comment to T329305: Testing Out Hard Drive on Swift Server.

No complaints from me, thanks, drive is now 25% loaded and behaving fine.

Mon, Mar 13, 5:33 PM · SRE-swift-storage, SRE, ops-eqiad, DC-Ops
MatthewVernon added a comment to T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw.

I am a bit concerned about the small rise in 50x codes reported on the ATS - backend dashboard since about 13:00:00 on 2023-03-09, but that doesn't match your time at all, and I can't find any relevant change that happened then; it is seen in both codfw and eqiad, too. Grepping proxy-access.log suggests that nearly all of the (at least today's) 50x codes from swift related to thumbnails, which makes me wonder if thumbor is unhappy again.

Mon, Mar 13, 2:26 PM · Wikimedia-production-error, SRE-swift-storage, Commons
MatthewVernon triaged T331860: Urgent: Two failed disks in ms-be2040 as High priority.
Mon, Mar 13, 10:38 AM · ops-codfw, SRE-swift-storage, SRE, DC-Ops
MatthewVernon created T331860: Urgent: Two failed disks in ms-be2040.
Mon, Mar 13, 10:38 AM · ops-codfw, SRE-swift-storage, SRE, DC-Ops

Fri, Mar 10

MatthewVernon added a comment to T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw.

I'm struggling to find any corresponding uptick in error logs on swift frontends, which is making me wonder if this isn't coming from elsewhere in the stack. But it's nearly COB on Friday and I'm running out of good ideas, sorry.

Fri, Mar 10, 4:49 PM · Wikimedia-production-error, SRE-swift-storage, Commons
MatthewVernon added a comment to T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw.

Similarly, looking at logstash searches for e.g. host: thumbor* AND message:502 doesn't show any interesting rate changes.

Fri, Mar 10, 4:37 PM · Wikimedia-production-error, SRE-swift-storage, Commons
sbassett awarded T331554: New production ssh key for sbassett a Like token.
Fri, Mar 10, 3:25 PM · SecTeam-Processed, Security, SRE-Access-Requests, SRE
MatthewVernon added a comment to T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw.

I don't think much has changed regarding prod swift since 6th March (when we added a couple more frontends - T331178 ); the reimaging work was apropos some new backends (T326352), but those nodes aren't in service (relevantly, aren't in the swift rings at all).

Fri, Mar 10, 2:46 PM · Wikimedia-production-error, SRE-swift-storage, Commons
MatthewVernon closed T331554: New production ssh key for sbassett as Resolved.

@sbassett done.

Fri, Mar 10, 9:58 AM · SecTeam-Processed, Security, SRE-Access-Requests, SRE
MatthewVernon moved T331647: Grant Hal deployment rights from Manager/NDA Approval/Confirmation to Awaiting User Input on the SRE-Access-Requests board.
Fri, Mar 10, 9:52 AM · SRE, SRE-Access-Requests
MatthewVernon added a comment to T331647: Grant Hal deployment rights.

Happy to hold off until we're sure what the correct group is :)

Fri, Mar 10, 9:52 AM · SRE, SRE-Access-Requests
MatthewVernon moved T331647: Grant Hal deployment rights from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Fri, Mar 10, 9:35 AM · SRE, SRE-Access-Requests
MatthewVernon updated subscribers of T331647: Grant Hal deployment rights.

@thcipriani you're the approver for the deployment group, are you happy for me to add @Htriedman to it, please?

Fri, Mar 10, 9:34 AM · SRE, SRE-Access-Requests

Thu, Mar 9

MatthewVernon added a comment to T331633: Not receiving posts or moderation messages.

Setting to medium priority, because this is probably now just a case of waiting for the queue to drain.

Thu, Mar 9, 4:49 PM · SRE, Wikimedia-Mailing-lists
MatthewVernon lowered the priority of T331633: Not receiving posts or moderation messages from Unbreak Now! to Medium.
Thu, Mar 9, 4:49 PM · SRE, Wikimedia-Mailing-lists
MatthewVernon added a project to T329305: Testing Out Hard Drive on Swift Server: SRE-swift-storage.
Thu, Mar 9, 4:06 PM · SRE-swift-storage, SRE, ops-eqiad, DC-Ops
MatthewVernon added a comment to T329305: Testing Out Hard Drive on Swift Server.

The swapped-in drive seems OK initially, I'll get swift to start using it shortly.

Thu, Mar 9, 3:28 PM · SRE-swift-storage, SRE, ops-eqiad, DC-Ops
MatthewVernon added a comment to T329305: Testing Out Hard Drive on Swift Server.

[after a reboot the drive in slot 2 was in a "Foreign" state; clearing that made it possible to reintroduce it with sudo megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0 and the filesystem recovered OK.

Thu, Mar 9, 3:13 PM · SRE-swift-storage, SRE, ops-eqiad, DC-Ops
MatthewVernon added a comment to T329305: Testing Out Hard Drive on Swift Server.

Can you check the drives in slots 23 and 2 are seated proper please? the kernel still can't see them.

Thu, Mar 9, 2:44 PM · SRE-swift-storage, SRE, ops-eqiad, DC-Ops
MatthewVernon added a comment to T329305: Testing Out Hard Drive on Swift Server.

Target Id 4 also missing

Thu, Mar 9, 2:33 PM · SRE-swift-storage, SRE, ops-eqiad, DC-Ops
MatthewVernon added a comment to T329305: Testing Out Hard Drive on Swift Server.

Looking at these drives -

sdz is bus info: scsi@0:2.25.0
Target Id: 25 is Enclosure Device ID: 32 Slot Number: 23
Thu, Mar 9, 2:30 PM · SRE-swift-storage, SRE, ops-eqiad, DC-Ops
MatthewVernon added a comment to T329305: Testing Out Hard Drive on Swift Server.

Something has gone a bit awry, the kernel reports problems with two other drives instead:

Mar  9 14:13:57 ms-be1066 kernel: [11683056.185701] sd 0:2:4:0: [sdf] tag#699 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=4s
Mar  9 14:14:00 ms-be1066 kernel: [11683059.173114] sd 0:2:25:0: [sdz] tag#897 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Thu, Mar 9, 2:19 PM · SRE-swift-storage, SRE, ops-eqiad, DC-Ops
MatthewVernon added a comment to T329305: Testing Out Hard Drive on Swift Server.

Yes, please. I've unmounted a drive in ms-be1066 and turned on the locator light
sudo megacli -PDLocate -PhysDrv [32:15] -a0

Thu, Mar 9, 2:08 PM · SRE-swift-storage, SRE, ops-eqiad, DC-Ops
MatthewVernon added a project to T297314: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch): serviceops.
Thu, Mar 9, 12:08 PM · serviceops, Abstract Wikipedia team (Phase λ – Launch), Service-deployment-requests, Services, SRE
MatthewVernon closed T331500: Requesting access to analytics-privatedata-users for David Martin as Resolved.

@DMartin-WMF all done.

Thu, Mar 9, 12:06 PM · SRE, SRE-Access-Requests
MatthewVernon moved T331500: Requesting access to analytics-privatedata-users for David Martin from Awaiting User Input to Patch in Review on the SRE-Access-Requests board.
Thu, Mar 9, 10:15 AM · SRE, SRE-Access-Requests
MatthewVernon updated the task description for T331500: Requesting access to analytics-privatedata-users for David Martin.
Thu, Mar 9, 10:15 AM · SRE, SRE-Access-Requests
MatthewVernon moved T331554: New production ssh key for sbassett from Manager/NDA Approval/Confirmation to Patch in Review on the SRE-Access-Requests board.
Thu, Mar 9, 9:36 AM · SecTeam-Processed, Security, SRE-Access-Requests, SRE
MatthewVernon added a comment to T331554: New production ssh key for sbassett.

@sbassett I've opened a CR to update your ssh key - if you can confirm it's correct and +1 the CR, I'll merge it.

Thu, Mar 9, 9:36 AM · SecTeam-Processed, Security, SRE-Access-Requests, SRE

Wed, Mar 8

MatthewVernon added a comment to T326846: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004.

Hi @Jclark-ctr any news on getting these frontends ready for use, please?

Wed, Mar 8, 3:59 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon closed T331402: Grant Access to ldap/wmf for s-mukuti as Resolved.

@S_Mukuti this is all done for you now.

Wed, Mar 8, 12:28 PM · SRE, LDAP-Access-Requests
MatthewVernon added a member for WMF-NDA: S_Mukuti.
Wed, Mar 8, 12:28 PM
MatthewVernon added a comment to T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw.

I see two successful PUTs for that object (one per DC), and indeed it seems to be successfully uploaded...

Wed, Mar 8, 12:24 PM · Wikimedia-production-error, SRE-swift-storage, Commons
MatthewVernon added a member for WMF-NDA: lwatson.
Wed, Mar 8, 12:03 PM
MatthewVernon added a comment to T331402: Grant Access to ldap/wmf for s-mukuti.

...though there is a "Sarah Mukuti" wikitech user, which I think is correct?

Wed, Mar 8, 11:21 AM · SRE, LDAP-Access-Requests
MatthewVernon updated the task description for T331500: Requesting access to analytics-privatedata-users for David Martin.
Wed, Mar 8, 11:18 AM · SRE, SRE-Access-Requests
MatthewVernon closed T331370: Request access to the group ldap/wmf as Resolved.

@lwatson this is now done.

Wed, Mar 8, 11:14 AM · SRE, LDAP-Access-Requests
MatthewVernon moved T331402: Grant Access to ldap/wmf for s-mukuti from Backlog to Awaiting User Input on the LDAP-Access-Requests board.
Wed, Mar 8, 10:28 AM · SRE, LDAP-Access-Requests
MatthewVernon added a comment to T331402: Grant Access to ldap/wmf for s-mukuti.

@S_Mukuti I think this is a request to join the wmf LDAP group only?
Also, can you double-check the wikitech username, please? I can't find an account by that name.

Wed, Mar 8, 10:27 AM · SRE, LDAP-Access-Requests
MatthewVernon moved T331482: Grant Access to analytics_privatedata_users for FNavas-foundation from Backlog to Awaiting User Input on the LDAP-Access-Requests board.
Wed, Mar 8, 10:23 AM · SRE-Access-Requests, SRE
MatthewVernon added a comment to T331482: Grant Access to analytics_privatedata_users for FNavas-foundation.

@FNavas-foundation can I double-check what access you need for what purposes, please? You say you need access to turnilo - that can be done with just wmf access (not analytics_privatedata_users) - see https://wikitech.wikimedia.org/wiki/Analytics/Data_access and/or talk to your manager.

Wed, Mar 8, 10:23 AM · SRE-Access-Requests, SRE
MatthewVernon moved T331500: Requesting access to analytics-privatedata-users for David Martin from Untriaged to Awaiting User Input on the SRE-Access-Requests board.
Wed, Mar 8, 10:16 AM · SRE, SRE-Access-Requests
MatthewVernon updated subscribers of T331500: Requesting access to analytics-privatedata-users for David Martin.

@DMartin-WMF can I confirm you don't require kerberos access (you didn't explicitly ask for it; cf https://wikitech.wikimedia.org/wiki/Analytics/Data_access )?

Wed, Mar 8, 10:16 AM · SRE, SRE-Access-Requests
MatthewVernon updated the task description for T331500: Requesting access to analytics-privatedata-users for David Martin.
Wed, Mar 8, 10:09 AM · SRE, SRE-Access-Requests
MatthewVernon added a comment to T331138: FileBackendMultiWrite multi-dc and thumbnail handling.

One further thought - it would be nice if we could take swift's special 404-handler out of the equation, and have our object store be more "vanilla" in operation[0]. So e.g. have thumbnail requests be served from thumbor (or equivalent), and have that service worry about the caching/invalidation/etc. This task might not be the place to address that, but it seemed worth flagging if we're going to expend effort on rethinking how thumbnails are dealt with.

Wed, Mar 8, 10:04 AM · SRE-swift-storage, MediaWiki-File-management
MatthewVernon added a comment to T331138: FileBackendMultiWrite multi-dc and thumbnail handling.

Also: if pre-generation of thumbs makes sense (does it? do we have any numbers on this stuff?) then it should happen in all datacenters, not just the one where swift is "master".

Wed, Mar 8, 10:01 AM · SRE-swift-storage, MediaWiki-File-management
MatthewVernon added a comment to T331138: FileBackendMultiWrite multi-dc and thumbnail handling.

From a Data Persistence POV, thumbs are ephemeral / cached - we are going to start expiring them at some point, and the large size of the thumb containers makes running e.g. rclone on them unattractive at best. I'm not very happy about the idea of starting to consider them Real Objects, if you see what I mean.

Even if they're ephemeral, which I think is an ok assumption (thumbnails are caching), you need to be able to consistently invalidate them when an image is deleted/replaced, which we don't do. I think that is more important than adding a TTL to this cache.

Wed, Mar 8, 9:59 AM · SRE-swift-storage, MediaWiki-File-management
MatthewVernon added a comment to T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw.

That looks like a DNS error? unless I'm misreading net::ERR_NAME_NOT_RESOLVED which wasn't obviously apparent in the error that started this ticket.

Wed, Mar 8, 9:56 AM · Wikimedia-production-error, SRE-swift-storage, Commons

Tue, Mar 7

MatthewVernon added a comment to T326352: Q3:rack/setup/install ms-be207[0-3].

Yeah, I saw similar on ms-be2070; the problem being the disk isn't entirely blank. I suspect

sudo wipefs -a /dev/disk/by-path/pci-0000:18:00.0-scsi-0:0:0:0-part1
sudo wipefs -a /dev/disk/by-path/pci-0000:18:00.0-scsi-0:0:0:0

will do the trick.

Tue, Mar 7, 5:46 PM · SRE-swift-storage, SRE, Data-Persistence, ops-codfw, DC-Ops
MatthewVernon added a comment to T326352: Q3:rack/setup/install ms-be207[0-3].

...the icinga warning was systemd timing out waiting for smartd to start up (takes about 2 minutes).

Tue, Mar 7, 3:28 PM · SRE-swift-storage, SRE, Data-Persistence, ops-codfw, DC-Ops
MatthewVernon added a comment to T326352: Q3:rack/setup/install ms-be207[0-3].

@Papaul I've fixed the underlying problems and you'll see ms-be2070 reimaged to successful completion now, so hopefully that's you unblocked here.

Tue, Mar 7, 2:13 PM · SRE-swift-storage, SRE, Data-Persistence, ops-codfw, DC-Ops
MatthewVernon added a comment to T306098: Cloud VPS "swift" project Stretch deprecation.

Sorry, have updated that page now.

Tue, Mar 7, 2:11 PM · Cloud-VPS (Debian Stretch Deprecation)
MatthewVernon updated the task description for T329073: eqiad row A switches upgrade.
Tue, Mar 7, 1:59 PM · Patch-For-Review, Discovery-Search (Current work), Shared-Data-Infrastructure, Data-Engineering-Planning, DBA, SRE, Platform Engineering, Infrastructure-Foundations, Traffic, serviceops, Machine-Learning-Team, cloud-services-team, Data-Persistence, SRE Observability, serviceops-collab
MatthewVernon closed T331277: Requesting access to analytics-privatedata-users for Nicholas Ifeajika as Resolved.

@nickifeajika all done now.

Tue, Mar 7, 11:06 AM · SRE, SRE-Access-Requests
MatthewVernon added a member for WMF-NDA: nickifeajika.
Tue, Mar 7, 11:05 AM
MatthewVernon updated the task description for T331277: Requesting access to analytics-privatedata-users for Nicholas Ifeajika.
Tue, Mar 7, 10:02 AM · SRE, SRE-Access-Requests
MatthewVernon updated the task description for T331277: Requesting access to analytics-privatedata-users for Nicholas Ifeajika.
Tue, Mar 7, 9:34 AM · SRE, SRE-Access-Requests
MatthewVernon added a comment to T326352: Q3:rack/setup/install ms-be207[0-3].

d'oh, that seems likely, thank you!

Tue, Mar 7, 8:21 AM · SRE-swift-storage, SRE, Data-Persistence, ops-codfw, DC-Ops

Mon, Mar 6

MatthewVernon updated subscribers of T326352: Q3:rack/setup/install ms-be207[0-3].

@jbond I dunno if you have any thoughts about this? I've had a look at the iDRAC, and it has one of the SSDs as the boot device to try (which I'd expect to work), and all of the drives are set to non-RAID (and the convert-disks cookbook indeed says nothing to do). And it seemingly boots OK as a vanilla OS install.

Mon, Mar 6, 4:32 PM · SRE-swift-storage, SRE, Data-Persistence, ops-codfw, DC-Ops
MatthewVernon added a comment to T331277: Requesting access to analytics-privatedata-users for Nicholas Ifeajika.

@Miriam sorry, I forgot to ask: can I confirm that this is a time-limited account, and you are the contact regarding expiry, please? And can you tell me the expected expiry date, please?

Mon, Mar 6, 4:02 PM · SRE, SRE-Access-Requests
MatthewVernon updated the task description for T331277: Requesting access to analytics-privatedata-users for Nicholas Ifeajika.
Mon, Mar 6, 3:56 PM · SRE, SRE-Access-Requests
MatthewVernon moved T331277: Requesting access to analytics-privatedata-users for Nicholas Ifeajika from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Mon, Mar 6, 2:47 PM · SRE, SRE-Access-Requests
MatthewVernon updated the task description for T331277: Requesting access to analytics-privatedata-users for Nicholas Ifeajika.
Mon, Mar 6, 2:47 PM · SRE, SRE-Access-Requests