Page MenuHomePhabricator

VRiley-WMF (Valerie Riley)
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Aug 22 2023, 3:06 PM (35 w, 4 d)
Availability
Available
LDAP User
ValerieRiley
MediaWiki User
VRiley-WMF [ Global Accounts ]

Recent Activity

Wed, Apr 24

VRiley-WMF added a comment to T362990: hw troubleshooting: memory DIMM_B3 multi-bit memory errors for prometheus1005.

This was a duplicate ticket that was opened for https://phabricator.wikimedia.org/T360687

Wed, Apr 24, 7:58 PM · SRE, SRE Observability, ops-eqiad, DC-Ops
VRiley-WMF closed T362990: hw troubleshooting: memory DIMM_B3 multi-bit memory errors for prometheus1005 as Resolved.
Wed, Apr 24, 7:57 PM · SRE, SRE Observability, ops-eqiad, DC-Ops
VRiley-WMF closed T362990: hw troubleshooting: memory DIMM_B3 multi-bit memory errors for prometheus1005, a subtask of T362989: Requests to prometheus pushgateway are timing out, as Resolved.
Wed, Apr 24, 7:57 PM · Observability-Metrics
VRiley-WMF claimed T362990: hw troubleshooting: memory DIMM_B3 multi-bit memory errors for prometheus1005.
Wed, Apr 24, 7:56 PM · SRE, SRE Observability, ops-eqiad, DC-Ops
VRiley-WMF changed the status of T361087: backup1005 crashed from Open to In Progress.
Wed, Apr 24, 4:49 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
VRiley-WMF added a comment to T361087: backup1005 crashed.

We have received the PERC from Dell and I have just completed swapping it out. It now looks like the system can now see the PERC (previously, it wasn't). However, it does seem that the system will need to be rebuilt. @jcrespo would you be able to verify this? Thank you!

Wed, Apr 24, 4:08 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups

Tue, Apr 23

VRiley-WMF added a comment to T360687: Memory upgrade request for prometheus100[56].

B3 DIMM has been replaced and the server should be coming back online. Please check and confirm. Thank you!

Tue, Apr 23, 3:50 PM · SRE, ops-eqiad, Observability-Metrics

Mon, Apr 22

VRiley-WMF added a comment to T360687: Memory upgrade request for prometheus100[56].

hey @herron is there a specific time you'd like use to arrange for this activity? Let us know, thanks!

Mon, Apr 22, 10:49 PM · SRE, ops-eqiad, Observability-Metrics

Tue, Apr 16

VRiley-WMF added a comment to T361087: backup1005 crashed.

We have been able to get dell support on this unit. After sending over the logs for and they have reviewed it they suggested to update the BIOS and iDRAC. BIOS install went through fine. After completing the iDRAC update, it's not loading properly. Currently working with Dell to resolve this new issue that has been created.

Tue, Apr 16, 10:09 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups

Thu, Apr 11

VRiley-WMF closed T362065: decommission dumpsdata1002.eqiad.wmnet as Resolved.

Unracked server and ran the script for decommission

Thu, Apr 11, 8:06 PM · SRE, ops-eqiad, decommission-hardware
VRiley-WMF closed T362065: decommission dumpsdata1002.eqiad.wmnet, a subtask of T353787: Decom dumpsdata100[1-2], as Resolved.
Thu, Apr 11, 8:05 PM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Dumps-Generation
VRiley-WMF closed T362122: decommission wdqs1025.eqiad.wmnet as Resolved.

Removed server and ran decommission script

Thu, Apr 11, 6:10 PM · ops-eqiad, SRE, decommission-hardware
VRiley-WMF closed T362122: decommission wdqs1025.eqiad.wmnet, a subtask of T362080: decommission wdqs1025, as Resolved.
Thu, Apr 11, 6:09 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14)
VRiley-WMF updated the task description for T362122: decommission wdqs1025.eqiad.wmnet.
Thu, Apr 11, 6:09 PM · ops-eqiad, SRE, decommission-hardware
VRiley-WMF claimed T362122: decommission wdqs1025.eqiad.wmnet.
Thu, Apr 11, 6:04 PM · ops-eqiad, SRE, decommission-hardware
herron awarded T361251: titan100[12] ram/ssd upgrade coordination a Party Time token.
Thu, Apr 11, 4:48 PM · SRE Observability (FY2023/2024-Q4), SRE, observability, ops-eqiad
VRiley-WMF closed T361251: titan100[12] ram/ssd upgrade coordination as Resolved.

Worked with @herron and upgraded these servers. They came back properly and making this ticket as resolved. Thanks!

Thu, Apr 11, 4:45 PM · SRE Observability (FY2023/2024-Q4), SRE, observability, ops-eqiad
VRiley-WMF updated the task description for T361251: titan100[12] ram/ssd upgrade coordination.
Thu, Apr 11, 4:44 PM · SRE Observability (FY2023/2024-Q4), SRE, observability, ops-eqiad

Wed, Apr 10

VRiley-WMF added a comment to T361251: titan100[12] ram/ssd upgrade coordination.

That works for me. I'll be there to assist with it. Thank you!

Wed, Apr 10, 3:00 PM · SRE Observability (FY2023/2024-Q4), SRE, observability, ops-eqiad

Tue, Apr 9

VRiley-WMF moved T361087: backup1005 crashed from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
Tue, Apr 9, 8:39 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
VRiley-WMF added a comment to T361251: titan100[12] ram/ssd upgrade coordination.

We currently have all these parts needed for this upgrade. @fgiunchedi do you have an estimated time for these upgrades to take place? Please let us know and we'll schedule it. Thanks!

Tue, Apr 9, 5:44 PM · SRE Observability (FY2023/2024-Q4), SRE, observability, ops-eqiad
VRiley-WMF updated the task description for T361251: titan100[12] ram/ssd upgrade coordination.
Tue, Apr 9, 5:36 PM · SRE Observability (FY2023/2024-Q4), SRE, observability, ops-eqiad
VRiley-WMF claimed T361251: titan100[12] ram/ssd upgrade coordination.
Tue, Apr 9, 5:19 PM · SRE Observability (FY2023/2024-Q4), SRE, observability, ops-eqiad
VRiley-WMF closed T361968: db1246 crashed as Resolved.
Tue, Apr 9, 5:18 PM · SRE, ops-eqiad, DBA
VRiley-WMF added a comment to T361968: db1246 crashed.

Thank you! I will be closing this ticket.

Tue, Apr 9, 5:18 PM · SRE, ops-eqiad, DBA

Mon, Apr 8

VRiley-WMF closed T362064: decommission dumpsdata1001.eqiad.wmnet, a subtask of T353787: Decom dumpsdata100[1-2], as Resolved.
Mon, Apr 8, 6:15 PM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Dumps-Generation
VRiley-WMF closed T362064: decommission dumpsdata1001.eqiad.wmnet as Resolved.
Mon, Apr 8, 6:15 PM · SRE, ops-eqiad, decommission-hardware
VRiley-WMF added a comment to T362064: decommission dumpsdata1001.eqiad.wmnet.

Completed decommission of this device.

Mon, Apr 8, 6:15 PM · SRE, ops-eqiad, decommission-hardware
VRiley-WMF updated the task description for T362064: decommission dumpsdata1001.eqiad.wmnet.
Mon, Apr 8, 6:14 PM · SRE, ops-eqiad, decommission-hardware
VRiley-WMF claimed T362064: decommission dumpsdata1001.eqiad.wmnet.
Mon, Apr 8, 6:04 PM · SRE, ops-eqiad, decommission-hardware
VRiley-WMF added a comment to T361968: db1246 crashed.

Hi @Marostegui Thanks!

Mon, Apr 8, 5:36 PM · SRE, ops-eqiad, DBA
VRiley-WMF added a comment to T361968: db1246 crashed.

Dell has suggested the following

Mon, Apr 8, 2:52 PM · SRE, ops-eqiad, DBA

Fri, Apr 5

VRiley-WMF closed T360950: decommission logstash101[012] as Resolved.
Fri, Apr 5, 7:49 PM · SRE, ops-eqiad, Patch-For-Review, decommission-hardware
VRiley-WMF updated the task description for T360950: decommission logstash101[012].
Fri, Apr 5, 7:49 PM · SRE, ops-eqiad, Patch-For-Review, decommission-hardware
VRiley-WMF added a comment to T361968: db1246 crashed.

Sure thing, I will be reaching out to them.

Fri, Apr 5, 7:43 PM · SRE, ops-eqiad, DBA
VRiley-WMF moved T361968: db1246 crashed from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
Fri, Apr 5, 7:38 PM · SRE, ops-eqiad, DBA
VRiley-WMF added a comment to T361968: db1246 crashed.

Hey @Marostegui I am currently looking at this unit. I checked both power supplied and logged into the machine. However, it seems to be healthy and it doesn't seem to be reporting any errors at this time. Would you be able to confirm? Let me know, thanks!

Fri, Apr 5, 7:37 PM · SRE, ops-eqiad, DBA
VRiley-WMF claimed T361968: db1246 crashed.
Fri, Apr 5, 7:35 PM · SRE, ops-eqiad, DBA
VRiley-WMF moved T360950: decommission logstash101[012] from Backlog to Decommission on the ops-eqiad board.
Fri, Apr 5, 7:26 PM · SRE, ops-eqiad, Patch-For-Review, decommission-hardware
VRiley-WMF claimed T360950: decommission logstash101[012].
Fri, Apr 5, 7:26 PM · SRE, ops-eqiad, Patch-For-Review, decommission-hardware
VRiley-WMF closed T357093: Decommission puppetmaster1002, a subtask of T330490: Next steps for Puppet 7, as Resolved.
Fri, Apr 5, 6:38 PM · Puppet-Infrastructure, Puppet (Puppet 7.0), Patch-For-Review, Infrastructure-Foundations, SRE
VRiley-WMF closed T357093: Decommission puppetmaster1002 as Resolved.
Fri, Apr 5, 6:38 PM · Patch-For-Review, ops-eqiad, Puppet-Infrastructure, Infrastructure-Foundations, SRE
VRiley-WMF added a comment to T357093: Decommission puppetmaster1002.

This has been unracked and decommission script has been run.

Fri, Apr 5, 6:37 PM · Patch-For-Review, ops-eqiad, Puppet-Infrastructure, Infrastructure-Foundations, SRE
VRiley-WMF updated the task description for T357093: Decommission puppetmaster1002.
Fri, Apr 5, 6:37 PM · Patch-For-Review, ops-eqiad, Puppet-Infrastructure, Infrastructure-Foundations, SRE
VRiley-WMF claimed T357093: Decommission puppetmaster1002.
Fri, Apr 5, 6:34 PM · Patch-For-Review, ops-eqiad, Puppet-Infrastructure, Infrastructure-Foundations, SRE
VRiley-WMF closed T361372: decommission restbase10[19-27], a subtask of T354561: Hardware refresh: Decommission restbase10[19-27], as Resolved.
Fri, Apr 5, 6:30 PM · Patch-For-Review, Cassandra
VRiley-WMF closed T361372: decommission restbase10[19-27] as Resolved.
Fri, Apr 5, 6:30 PM · SRE, ops-eqiad, decommission-hardware
VRiley-WMF added a comment to T361372: decommission restbase10[19-27].

These servers have been removed and the decomm script has been run.

Fri, Apr 5, 6:30 PM · SRE, ops-eqiad, decommission-hardware
VRiley-WMF updated the task description for T361372: decommission restbase10[19-27].
Fri, Apr 5, 6:29 PM · SRE, ops-eqiad, decommission-hardware
VRiley-WMF claimed T361372: decommission restbase10[19-27].
Fri, Apr 5, 6:18 PM · SRE, ops-eqiad, decommission-hardware

Wed, Apr 3

VRiley-WMF closed T360687: Memory upgrade request for prometheus100[56] as Resolved.
Wed, Apr 3, 5:24 PM · SRE, ops-eqiad, Observability-Metrics
VRiley-WMF added a comment to T360687: Memory upgrade request for prometheus100[56].

worked with @herron and added the 32Gig DDR4 2666 to the requested slots. Both servers came back up and reported the correct sizes as expected. Closing this ticket.

Wed, Apr 3, 5:24 PM · SRE, ops-eqiad, Observability-Metrics
VRiley-WMF moved T360687: Memory upgrade request for prometheus100[56] from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
Wed, Apr 3, 4:21 PM · SRE, ops-eqiad, Observability-Metrics

Tue, Apr 2

VRiley-WMF added a comment to T361087: backup1005 crashed.

Opened ticket with dell in order to see what they could assist with since when first contacting them, it was on the day the warranty expired. Awaiting response from Dell

Tue, Apr 2, 6:40 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
VRiley-WMF closed T361535: PDU sensor over limit as Resolved.
Tue, Apr 2, 6:39 PM · SRE, ops-eqiad
VRiley-WMF added a comment to T361535: PDU sensor over limit.

Rebalanced power cords.

Tue, Apr 2, 6:39 PM · SRE, ops-eqiad
VRiley-WMF claimed T361535: PDU sensor over limit.
Tue, Apr 2, 4:23 PM · SRE, ops-eqiad
VRiley-WMF closed T355353: Q3:rack/setup/install dbprov100[56] as Resolved.
Tue, Apr 2, 4:13 PM · Patch-For-Review, SRE, Data-Persistence, ops-eqiad, DC-Ops
VRiley-WMF added a comment to T355353: Q3:rack/setup/install dbprov100[56].

This is now completed.

Tue, Apr 2, 4:13 PM · Patch-For-Review, SRE, Data-Persistence, ops-eqiad, DC-Ops
VRiley-WMF updated the task description for T355353: Q3:rack/setup/install dbprov100[56].
Tue, Apr 2, 4:12 PM · Patch-For-Review, SRE, Data-Persistence, ops-eqiad, DC-Ops

Mar 28 2024

VRiley-WMF closed T358046: decommission cloudelastic100[1-4].wikimedia.org as Resolved.
Mar 28 2024, 5:53 PM · SRE, ops-eqiad, decommission-hardware
VRiley-WMF added a comment to T358046: decommission cloudelastic100[1-4].wikimedia.org.

@RKemper Thanks for bringing this up! I missed running the script for this device. It's been run and decommissioned.

Mar 28 2024, 5:53 PM · SRE, ops-eqiad, decommission-hardware
VRiley-WMF moved T361251: titan100[12] ram/ssd upgrade coordination from Racking Tasks to Hardware Failure / Troubleshoot on the ops-eqiad board.
Mar 28 2024, 4:18 PM · SRE Observability (FY2023/2024-Q4), SRE, observability, ops-eqiad

Mar 27 2024

VRiley-WMF claimed T361087: backup1005 crashed.
Mar 27 2024, 5:39 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups

Mar 26 2024

VRiley-WMF added a comment to T360687: Memory upgrade request for prometheus100[56].

@herron As it turns out, we currently don't have spare memory at 32Gig DDR4 3200. However, we have plenty of 32Gig DDR4 2666. Would this be an acceptable substitute? Let me know, thanks!

Mar 26 2024, 8:05 PM · SRE, ops-eqiad, Observability-Metrics
VRiley-WMF claimed T360687: Memory upgrade request for prometheus100[56].
Mar 26 2024, 8:02 PM · SRE, ops-eqiad, Observability-Metrics
VRiley-WMF closed T360722: PowerSupplyFailure as Resolved.
Mar 26 2024, 2:26 PM · SRE, ops-eqiad
VRiley-WMF added a comment to T360722: PowerSupplyFailure.

After inspection, was unable to see which power supply was causing this issue. No indication while logged into the unit, and no LED's indicating such failure. Ran a firmware update for iDrac and this has resolved the error. Closing this ticket.

Mar 26 2024, 2:25 PM · SRE, ops-eqiad

Mar 22 2024

VRiley-WMF claimed T360722: PowerSupplyFailure.
Mar 22 2024, 5:24 PM · SRE, ops-eqiad
VRiley-WMF closed T353845: decommission wdqs100[6-8] as Resolved.
Mar 22 2024, 5:23 PM · SRE, ops-eqiad, decommission-hardware
VRiley-WMF closed T353845: decommission wdqs100[6-8], a subtask of T351671: Service implementation for wdqs10[17-21], as Resolved.
Mar 22 2024, 5:23 PM · Data-Platform-SRE (2023.12.01 - 2023.12.31)
VRiley-WMF added a comment to T353845: decommission wdqs100[6-8].

This has been completed

Mar 22 2024, 5:23 PM · SRE, ops-eqiad, decommission-hardware
VRiley-WMF updated the task description for T353845: decommission wdqs100[6-8].
Mar 22 2024, 5:23 PM · SRE, ops-eqiad, decommission-hardware
VRiley-WMF updated the task description for T353845: decommission wdqs100[6-8].
Mar 22 2024, 5:19 PM · SRE, ops-eqiad, decommission-hardware
VRiley-WMF updated the task description for T353845: decommission wdqs100[6-8].
Mar 22 2024, 5:14 PM · SRE, ops-eqiad, decommission-hardware
VRiley-WMF claimed T353845: decommission wdqs100[6-8].
Mar 22 2024, 4:22 PM · SRE, ops-eqiad, decommission-hardware

Mar 6 2024

VRiley-WMF added a comment to T358727: Reclaim recently-decommed CP host for WDQS (see T352253).

Swapped cable with a new one (same port), shut down the unit and reseated the drives as well. Powered the unit back on

Mar 6 2024, 6:33 PM · Discovery-Search (Current work), Data-Platform-SRE (2024.03.04 - 2024.03.24), Wikidata, wmde-wikidata-tech, SRE, ops-eqiad
VRiley-WMF added a comment to T358727: Reclaim recently-decommed CP host for WDQS (see T352253).

@dr0ptp4kt would you be able to try to reimage this unit again? I have ran it through a power cycle and that can help with this process. Let us know, thanks!

Mar 6 2024, 5:35 PM · Discovery-Search (Current work), Data-Platform-SRE (2024.03.04 - 2024.03.24), Wikidata, wmde-wikidata-tech, SRE, ops-eqiad

Mar 5 2024

VRiley-WMF closed T358727: Reclaim recently-decommed CP host for WDQS (see T352253) as Resolved.
Mar 5 2024, 3:51 PM · Discovery-Search (Current work), Data-Platform-SRE (2024.03.04 - 2024.03.24), Wikidata, wmde-wikidata-tech, SRE, ops-eqiad
VRiley-WMF closed T358727: Reclaim recently-decommed CP host for WDQS (see T352253), a subtask of T336443: Investigate performance differences between wdqs2022 and older hosts, as Resolved.
Mar 5 2024, 3:51 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
VRiley-WMF closed T358727: Reclaim recently-decommed CP host for WDQS (see T352253), a subtask of T358533: Hardware requests for Search Platform FY2024-2025, as Resolved.
Mar 5 2024, 3:51 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14)
VRiley-WMF added a comment to T358727: Reclaim recently-decommed CP host for WDQS (see T352253).

Thank you @cmooney ! I have also relabeled this unit to match the name. Closing this ticket as per our discussion since it's completed from a DC Ops perspective.

Mar 5 2024, 3:50 PM · Discovery-Search (Current work), Data-Platform-SRE (2024.03.04 - 2024.03.24), Wikidata, wmde-wikidata-tech, SRE, ops-eqiad

Mar 4 2024

VRiley-WMF closed T359086: PowerSupplyFailure as Resolved.
Mar 4 2024, 6:24 PM · SRE, ops-eqiad
VRiley-WMF added a comment to T359086: PowerSupplyFailure.

Reseated the power supply cable. Monitored issue and the error has been resolved.

Mar 4 2024, 6:24 PM · SRE, ops-eqiad
VRiley-WMF claimed T359086: PowerSupplyFailure.
Mar 4 2024, 6:23 PM · SRE, ops-eqiad

Mar 1 2024

VRiley-WMF added a comment to T358727: Reclaim recently-decommed CP host for WDQS (see T352253).

Hi @dr0ptp4kt I have racked and stacked cp1086 in the following location

Mar 1 2024, 7:27 PM · Discovery-Search (Current work), Data-Platform-SRE (2024.03.04 - 2024.03.24), Wikidata, wmde-wikidata-tech, SRE, ops-eqiad
VRiley-WMF closed T358787: PowerSupplyFailure - an-coord1003 as Resolved.
Mar 1 2024, 7:03 PM · SRE, ops-eqiad
VRiley-WMF added a comment to T358787: PowerSupplyFailure - an-coord1003.

Swapped out power supply. It is back in operation.

Mar 1 2024, 7:03 PM · SRE, ops-eqiad
VRiley-WMF updated subscribers of T355353: Q3:rack/setup/install dbprov100[56].

@jcrespo We are at the point to image. Would you be able to assist for updating Puppet?

Mar 1 2024, 5:51 PM · Patch-For-Review, SRE, Data-Persistence, ops-eqiad, DC-Ops
VRiley-WMF claimed T358727: Reclaim recently-decommed CP host for WDQS (see T352253).
Mar 1 2024, 4:43 PM · Discovery-Search (Current work), Data-Platform-SRE (2024.03.04 - 2024.03.24), Wikidata, wmde-wikidata-tech, SRE, ops-eqiad
VRiley-WMF moved T358787: PowerSupplyFailure - an-coord1003 from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
Mar 1 2024, 4:31 PM · SRE, ops-eqiad
VRiley-WMF closed T356474: decommission db1118.eqiad.wmnet, a subtask of T326683: Decommission db1106-db1125, as Resolved.
Mar 1 2024, 4:29 PM · DBA
VRiley-WMF closed T356474: decommission db1118.eqiad.wmnet, a subtask of T341489: Create OTRS Database Snapshot, as Resolved.
Mar 1 2024, 4:29 PM · collaboration-services, DBA
VRiley-WMF closed T356474: decommission db1118.eqiad.wmnet as Resolved.
Mar 1 2024, 4:29 PM · SRE, ops-eqiad, DC-Ops, DBA, decommission-hardware
VRiley-WMF added a comment to T356474: decommission db1118.eqiad.wmnet.

Unracked and ran decommission script

Mar 1 2024, 4:29 PM · SRE, ops-eqiad, DC-Ops, DBA, decommission-hardware
VRiley-WMF claimed T356474: decommission db1118.eqiad.wmnet.
Mar 1 2024, 4:28 PM · SRE, ops-eqiad, DC-Ops, DBA, decommission-hardware
VRiley-WMF claimed T358787: PowerSupplyFailure - an-coord1003.
Mar 1 2024, 3:49 PM · SRE, ops-eqiad

Feb 28 2024

VRiley-WMF updated the task description for T355353: Q3:rack/setup/install dbprov100[56].
Feb 28 2024, 6:51 PM · Patch-For-Review, SRE, Data-Persistence, ops-eqiad, DC-Ops
VRiley-WMF updated the task description for T355353: Q3:rack/setup/install dbprov100[56].
Feb 28 2024, 6:29 PM · Patch-For-Review, SRE, Data-Persistence, ops-eqiad, DC-Ops
VRiley-WMF added a comment to T355353: Q3:rack/setup/install dbprov100[56].

dbprov1005
Rack A2
U 25
CableID 4905
Port 8

Feb 28 2024, 5:15 PM · Patch-For-Review, SRE, Data-Persistence, ops-eqiad, DC-Ops